What is parse-latin?
The parse-latin npm package is a JavaScript library used to parse Latin-script natural language into a syntax tree. It is particularly useful for text processing tasks such as tokenization, sentence splitting, and word segmentation.
What are parse-latin's main functionalities?
Tokenization
This feature allows you to tokenize a given text into individual tokens (words, punctuation, etc.). The code sample demonstrates how to tokenize a simple sentence.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const tokens = parser.tokenize('This is a sentence.');
console.log(tokens);
Sentence Splitting
This feature enables you to split a paragraph into individual sentences. The code sample shows how to split a paragraph into separate sentences.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const sentences = parser.tokenizeParagraph('This is a sentence. This is another sentence.');
console.log(sentences);
Word Segmentation
This feature allows you to segment a sentence into individual words. The code sample demonstrates how to segment a sentence into words.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const words = parser.tokenizeWords('This is a sentence.');
console.log(words);
Other packages similar to parse-latin
compromise
Compromise is a natural language processing library for JavaScript that provides a wide range of text processing functionalities, including tokenization, part-of-speech tagging, and named entity recognition. Compared to parse-latin, Compromise offers more advanced NLP features and is more versatile.
natural
Natural is a general natural language processing library for JavaScript. It includes functionalities such as tokenization, stemming, classification, and phonetics. Natural is more feature-rich compared to parse-latin and is suitable for a wide range of NLP tasks.
parse-latin
A Latin-script language parser for retext producing nlcst
nodes.
Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum
penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”),
parse-latin
does a good job at tokenizing it.
Note also that parse-latin
does a decent job at tokenizing Latin-like scripts,
Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի
է”), and such.
Install
This package is ESM only: Node 12+ is needed to use it and it must be import
ed
instead of require
d.
npm:
npm install parse-latin
Use
import {inspect} from 'unist-util-inspect'
import {ParseLatin} from 'parse-latin'
const tree = new ParseLatin().parse('A simple sentence.')
console.log(inspect(tree))
Which, when inspecting, yields:
RootNode[1] (1:1-1:19, 0-18)
└─0 ParagraphNode[1] (1:1-1:19, 0-18)
└─0 SentenceNode[6] (1:1-1:19, 0-18)
├─0 WordNode[1] (1:1-1:2, 0-1)
│ └─0 TextNode "A" (1:1-1:2, 0-1)
├─1 WhiteSpaceNode " " (1:2-1:3, 1-2)
├─2 WordNode[1] (1:3-1:9, 2-8)
│ └─0 TextNode "simple" (1:3-1:9, 2-8)
├─3 WhiteSpaceNode " " (1:9-1:10, 8-9)
├─4 WordNode[1] (1:10-1:18, 9-17)
│ └─0 TextNode "sentence" (1:10-1:18, 9-17)
└─5 PunctuationNode "." (1:18-1:19, 17-18)
API
This package exports the following identifiers: ParseLatin
.
There is no default export.
ParseLatin(value)
Exposes the functionality needed to tokenize natural Latin-script languages into
a syntax tree.
If value
is passed here, it’s not needed to give it to #parse()
.
ParseLatin#tokenize(value)
Tokenize value
(string
) into letters and numbers (words), white space, and
everything else (punctuation).
The returned nodes are a flat list without paragraphs or sentences.
Returns
Array.<Node>
— Nodes.
ParseLatin#parse(value)
Tokenize value
(string
) into an NLCST tree.
The returned node is a RootNode
with in it paragraphs and sentences.
Returns
Node
— Root node.
Algorithm
Note: The easiest way to see how parse-latin tokenizes and parses, is by
using the online parser demo, which
shows the syntax tree corresponding to the typed text.
parse-latin
splits text into white space, word, and punctuation tokens.
parse-latin
starts out with a pretty easy definition, one that most other
tokenizers use:
- A “word” is one or more letter or number characters
- A “white space” is one or more white space characters
- A “punctuation” is one or more of anything else
Then, it manipulates and merges those tokens into a (nlcst) syntax tree,
adding sentences and paragraphs where needed.
- Some punctuation marks are part of the word they occur in, such as
non-profit
, she’s
, G.I.
, 11:00
, N/A
, &c
, nineteenth- and…
- Some full-stops do not mark a sentence end, such as
1.
, e.g.
, id.
- Although full-stops, question marks, and exclamation marks (sometimes) end a
sentence, that end might not occur directly after the mark, such as
.)
,
."
- And many more exceptions
License
MIT © Titus Wormer