parse-latin
Natural language parser, for Latin-script languages, that produces nlcst.
Contents
What is this?
This package exposes a parser that takes Latin-script natural language and
produces a syntax tree.
When should I use this?
If you want to handle natural language as syntax trees manually, use this.
Alternatively, you can use the retext plugin retext-latin
,
which wraps this project to also parse natural language at a higher-level
(easier) abstraction.
Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum
penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”),
this project does a good job at tokenizing it.
For English and Dutch, you can instead use parse-english
and
parse-dutch
.
You can somewhat use this for Latin-like scripts, such as Cyrillic (“привет”),
Georgian (“გამარჯობა”), Armenian (“Բարեւ”), and such.
Install
This package is ESM only.
In Node.js (version 16+), install with npm:
npm install parse-latin
In Deno with esm.sh
:
import {ParseLatin} from 'https://esm.sh/parse-latin@7'
In browsers with esm.sh
:
<script type="module">
import {ParseLatin} from 'https://esm.sh/parse-latin@7?bundle'
</script>
Use
import {ParseLatin} from 'parse-latin'
import {inspect} from 'unist-util-inspect'
const tree = new ParseLatin().parse('A simple sentence.')
console.log(inspect(tree))
Yields:
RootNode[1] (1:1-1:19, 0-18)
└─0 ParagraphNode[1] (1:1-1:19, 0-18)
└─0 SentenceNode[6] (1:1-1:19, 0-18)
├─0 WordNode[1] (1:1-1:2, 0-1)
│ └─0 TextNode "A" (1:1-1:2, 0-1)
├─1 WhiteSpaceNode " " (1:2-1:3, 1-2)
├─2 WordNode[1] (1:3-1:9, 2-8)
│ └─0 TextNode "simple" (1:3-1:9, 2-8)
├─3 WhiteSpaceNode " " (1:9-1:10, 8-9)
├─4 WordNode[1] (1:10-1:18, 9-17)
│ └─0 TextNode "sentence" (1:10-1:18, 9-17)
└─5 PunctuationNode "." (1:18-1:19, 17-18)
API
This package exports the identifier ParseLatin
.
There is no default export.
ParseLatin()
Create a new parser.
ParseLatin#parse(value)
Turn natural language into a syntax tree.
Parameters
value
(string
, optional)
— value to parse
Returns
Tree (RootNode
).
Algorithm
👉 Note:
The easiest way to see how parse-latin
parses, is by using the
online parser demo, which shows the syntax tree corresponding to
the typed text.
parse-latin
splits text into white space, punctuation, symbol, and word
tokens:
- “word” is one or more unicode letters or numbers
- “white space” is one or more unicode white space characters
- “punctuation” is one or more unicode punctuation characters
- “symbol” is one or more of anything else
Then, it manipulates and merges those tokens into a syntax tree, adding
sentences and paragraphs where needed.
- some punctuation marks are part of the word they occur in, such as
non-profit
, she’s
, G.I.
, 11:00
, N/A
, &c
, nineteenth- and…
- some periods do not mark a sentence end, such as
1.
, e.g.
, id.
- although periods, question marks, and exclamation marks (sometimes) end a
sentence, that end might not occur directly after the mark, such as
.)
,
."
- …and many more exceptions
Types
This package is fully typed with TypeScript.
It exports no additional types.
Compatibility
Projects maintained by me are compatible with maintained versions of Node.js.
When I cut a new major release, I drop support for unmaintained versions of
Node.
This means I try to keep the current release line, parse-latin@^7
, compatible
with Node.js 16.
Security
This package is safe.
Related
Contribute
Yes please!
See How to Contribute to Open Source.
License
MIT © Titus Wormer