What is parse-latin?
The parse-latin npm package is a JavaScript library used to parse Latin-script natural language into a syntax tree. It is particularly useful for text processing tasks such as tokenization, sentence splitting, and word segmentation.
What are parse-latin's main functionalities?
Tokenization
This feature allows you to tokenize a given text into individual tokens (words, punctuation, etc.). The code sample demonstrates how to tokenize a simple sentence.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const tokens = parser.tokenize('This is a sentence.');
console.log(tokens);
Sentence Splitting
This feature enables you to split a paragraph into individual sentences. The code sample shows how to split a paragraph into separate sentences.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const sentences = parser.tokenizeParagraph('This is a sentence. This is another sentence.');
console.log(sentences);
Word Segmentation
This feature allows you to segment a sentence into individual words. The code sample demonstrates how to segment a sentence into words.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const words = parser.tokenizeWords('This is a sentence.');
console.log(words);
Other packages similar to parse-latin
compromise
Compromise is a natural language processing library for JavaScript that provides a wide range of text processing functionalities, including tokenization, part-of-speech tagging, and named entity recognition. Compared to parse-latin, Compromise offers more advanced NLP features and is more versatile.
natural
Natural is a general natural language processing library for JavaScript. It includes functionalities such as tokenization, stemming, classification, and phonetics. Natural is more feature-rich compared to parse-latin and is suitable for a wide range of NLP tasks.
parse-latin
A natural language parser, for Latin-script languages, that produces nlcst.
Contents
What is this?
This package exposes a parser that takes Latin-script natural language and
produces a syntax tree.
When should I use this?
If you want to handle natural language as syntax trees manually, use this.
Alternatively, you can use the retext plugin retext-latin
,
which wraps this project to also parse natural language at a higher-level
(easier) abstraction.
Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum
penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”),
this project does a good job at tokenizing it.
For English and Dutch, you can instead use parse-english
and
parse-dutch
.
You can somewhat use this for Latin-like scripts, such as Cyrillic
(“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի է”),
and such.
Install
This package is ESM only.
In Node.js (version 14.14+ and 16.0+), install with npm:
npm install parse-latin
In Deno with esm.sh
:
import {ParseLatin} from 'https://esm.sh/parse-latin@6'
In browsers with esm.sh
:
<script type="module">
import {ParseLatin} from 'https://esm.sh/parse-latin@6?bundle'
</script>
Use
import {inspect} from 'unist-util-inspect'
import {ParseLatin} from 'parse-latin'
const tree = new ParseLatin().parse('A simple sentence.')
console.log(inspect(tree))
Yields:
RootNode[1] (1:1-1:19, 0-18)
└─0 ParagraphNode[1] (1:1-1:19, 0-18)
└─0 SentenceNode[6] (1:1-1:19, 0-18)
├─0 WordNode[1] (1:1-1:2, 0-1)
│ └─0 TextNode "A" (1:1-1:2, 0-1)
├─1 WhiteSpaceNode " " (1:2-1:3, 1-2)
├─2 WordNode[1] (1:3-1:9, 2-8)
│ └─0 TextNode "simple" (1:3-1:9, 2-8)
├─3 WhiteSpaceNode " " (1:9-1:10, 8-9)
├─4 WordNode[1] (1:10-1:18, 9-17)
│ └─0 TextNode "sentence" (1:10-1:18, 9-17)
└─5 PunctuationNode "." (1:18-1:19, 17-18)
API
This package exports the identifier ParseLatin
.
There is no default export.
ParseLatin()
Create a new parser.
ParseLatin#parse(value)
Turn natural language into a syntax tree.
Parameters
value
(string
, optional)
— value to parse
Returns
Tree (RootNode
).
Algorithm
👉 Note:
The easiest way to see how parse-latin
parses, is by using the
online parser demo, which shows the syntax tree corresponding to
the typed text.
parse-latin
splits text into white space, punctuation, symbol, and word
tokens:
- “word” is one or more unicode letters or numbers
- “white space” is one or more unicode white space characters
- “punctuation” is one or more unicode punctuation characters
- “symbol” is one or more of anything else
Then, it manipulates and merges those tokens into a syntax tree, adding
sentences and paragraphs where needed.
- some punctuation marks are part of the word they occur in, such as
non-profit
, she’s
, G.I.
, 11:00
, N/A
, &c
, nineteenth- and…
- some periods do not mark a sentence end, such as
1.
, e.g.
, id.
- although periods, question marks, and exclamation marks (sometimes) end a
sentence, that end might not occur directly after the mark, such as
.)
,
."
- …and many more exceptions
Types
This package is fully typed with TypeScript.
It exports no additional types.
Compatibility
This package is at least compatible with all maintained versions of Node.js.
As of now, that is Node.js 14.14+ and 16.0+.
It also works in Deno and modern browsers.
Related
Contribute
Yes please!
See How to Contribute to Open Source.
Security
This package is safe.
License
MIT © Titus Wormer