Security News
JavaScript Leaders Demand Oracle Release the JavaScript Trademark
In an open letter, JavaScript community leaders urge Oracle to give up the JavaScript trademark, arguing that it has been effectively abandoned through nonuse.
parse-latin
Advanced tools
The parse-latin npm package is a JavaScript library used to parse Latin-script natural language into a syntax tree. It is particularly useful for text processing tasks such as tokenization, sentence splitting, and word segmentation.
Tokenization
This feature allows you to tokenize a given text into individual tokens (words, punctuation, etc.). The code sample demonstrates how to tokenize a simple sentence.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const tokens = parser.tokenize('This is a sentence.');
console.log(tokens);
Sentence Splitting
This feature enables you to split a paragraph into individual sentences. The code sample shows how to split a paragraph into separate sentences.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const sentences = parser.tokenizeParagraph('This is a sentence. This is another sentence.');
console.log(sentences);
Word Segmentation
This feature allows you to segment a sentence into individual words. The code sample demonstrates how to segment a sentence into words.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const words = parser.tokenizeWords('This is a sentence.');
console.log(words);
Compromise is a natural language processing library for JavaScript that provides a wide range of text processing functionalities, including tokenization, part-of-speech tagging, and named entity recognition. Compared to parse-latin, Compromise offers more advanced NLP features and is more versatile.
Natural is a general natural language processing library for JavaScript. It includes functionalities such as tokenization, stemming, classification, and phonetics. Natural is more feature-rich compared to parse-latin and is suitable for a wide range of NLP tasks.
A Latin-script language parser for retext producing nlcst nodes.
Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum
penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”),
parse-latin
does a good job at tokenizing it.
Note also that parse-latin
does a decent job at tokenizing Latin-like scripts,
Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի
է”), and such.
This package is ESM only: Node 12+ is needed to use it and it must be import
ed
instead of require
d.
npm:
npm install parse-latin
import {inspect} from 'unist-util-inspect'
import {ParseLatin} from 'parse-latin'
const tree = new ParseLatin().parse('A simple sentence.')
console.log(inspect(tree))
Which, when inspecting, yields:
RootNode[1] (1:1-1:19, 0-18)
└─0 ParagraphNode[1] (1:1-1:19, 0-18)
└─0 SentenceNode[6] (1:1-1:19, 0-18)
├─0 WordNode[1] (1:1-1:2, 0-1)
│ └─0 TextNode "A" (1:1-1:2, 0-1)
├─1 WhiteSpaceNode " " (1:2-1:3, 1-2)
├─2 WordNode[1] (1:3-1:9, 2-8)
│ └─0 TextNode "simple" (1:3-1:9, 2-8)
├─3 WhiteSpaceNode " " (1:9-1:10, 8-9)
├─4 WordNode[1] (1:10-1:18, 9-17)
│ └─0 TextNode "sentence" (1:10-1:18, 9-17)
└─5 PunctuationNode "." (1:18-1:19, 17-18)
This package exports the following identifiers: ParseLatin
.
There is no default export.
ParseLatin(value)
Exposes the functionality needed to tokenize natural Latin-script languages into
a syntax tree.
If value
is passed here, it’s not needed to give it to #parse()
.
ParseLatin#tokenize(value)
Tokenize value
(string
) into letters and numbers (words), white space, and
everything else (punctuation).
The returned nodes are a flat list without paragraphs or sentences.
Array.<Node>
— Nodes.
ParseLatin#parse(value)
Tokenize value
(string
) into an NLCST tree.
The returned node is a RootNode
with in it paragraphs and sentences.
Node
— Root node.
Note: The easiest way to see how parse-latin tokenizes and parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.
parse-latin
splits text into white space, word, and punctuation tokens.
parse-latin
starts out with a pretty easy definition, one that most other
tokenizers use:
Then, it manipulates and merges those tokens into a (nlcst) syntax tree, adding sentences and paragraphs where needed.
non-profit
, she’s
, G.I.
, 11:00
, N/A
, &c
, nineteenth- and…
1.
, e.g.
, id.
.)
,
."
FAQs
Latin-script (natural language) parser
The npm package parse-latin receives a total of 347,827 weekly downloads. As such, parse-latin popularity was classified as popular.
We found that parse-latin demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
In an open letter, JavaScript community leaders urge Oracle to give up the JavaScript trademark, arguing that it has been effectively abandoned through nonuse.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.