Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

parse-latin

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

parse-latin

Latin-script (natural language) parser

0.1.0-rc.4
0.1.0-rc.4
Source
npm

Version published: 10 years ago

Weekly downloads: 468K; increased by2.2%

Maintainers: 1

Weekly downloads

Created: 10 years ago

What is parse-latin?

The parse-latin npm package is a JavaScript library used to parse Latin-script natural language into a syntax tree. It is particularly useful for text processing tasks such as tokenization, sentence splitting, and word segmentation.

What are parse-latin's main functionalities?

Tokenization

This feature allows you to tokenize a given text into individual tokens (words, punctuation, etc.). The code sample demonstrates how to tokenize a simple sentence.

const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const tokens = parser.tokenize('This is a sentence.');
console.log(tokens);

Sentence Splitting

This feature enables you to split a paragraph into individual sentences. The code sample shows how to split a paragraph into separate sentences.

const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const sentences = parser.tokenizeParagraph('This is a sentence. This is another sentence.');
console.log(sentences);

Word Segmentation

This feature allows you to segment a sentence into individual words. The code sample demonstrates how to segment a sentence into words.

const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const words = parser.tokenizeWords('This is a sentence.');
console.log(words);

Other packages similar to parse-latin

parse-latin

See Browser Support for more information (a.k.a. don’t worry about those grey icons above).

parse-latin is an Latin-script language parser in JavaScript. NodeJS, and the browser. Lots of tests (330+), including 630+ assertions. 100% coverage.

Note: This project is not an object model for natural languages, or an extensible system for analysing and manipulating natural language, its an algorithm that transforms plain-text natural language into an AST. If you need the above-mentioned functionalities, use the following projects.

For a pluggable system for analysing and manipulating natural language, see retext.
For an object model, see TextOM.

Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”), this parser does a pretty good job at tokenising it.

Note also that it seems to parse other scripts, such as Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի է”), pretty well!

Installation

NPM:

$ npm install parse-latin

Component.js:

$ component install wooorm/parse-latin

Usage

var Parser = require('parse-latin'),
    parser = new Parser(),
    root;

/* Simple sentence: */
parser.tokenizeRoot('A simple sentence.');
/*
 * ˅ Object
 *    ˃ children: Array[1]
 *      type: "RootNode"
 *    ˃ __proto__: Object
 */

/* Unicode filled sentence: */
parser.tokenizeRoot('The \xC5 symbol invented by A. J. A\u030Angstro\u0308m (1814, Lo\u0308gdo\u0308, \u2013 1874) denotes the length 10\u207B\xB9\u2070 m.');
/*
 * ˅ Object
 *    ˃ children: Array[1]
 *      type: "RootNode"
 *    ˃ __proto__: Object
 */

API

parseLatin.tokenizeRoot(source?)

var Parser = require('parse-latin');

new Parser().tokenizeRoot('A simple sentence.');
/*
 * Object
 * ├─ type: "RootNode"
 * └─ children: Array[1]
 *    └─ 0: Object
 *          ├─ type: "ParagraphNode"
 *          └─ children: Array[1]
 *             └─ 0: Object
 *                   ├─ type: "SentenceNode"
 *                   └─ children: Array[6]
 *                      | ...
 */

Tokenise a given document into paragraphs, sentences, words, white space, and punctuation.

source (null, undefined, or String): The latin document to parse.

parseLatin.tokenizeParagraph(source?)

var Parser = require('parse-latin');

new Parser().tokenizeParagraph('A simple sentence.');
/*
 * Object
 * ├─ type: "ParagraphNode"
 * └─ children: Array[1]
 *    └─ 0: Object
 *          ├─ type: "SentenceNode"
 *          └─ children: Array[6]
 *             | ...
 */

Tokenise a given paragraph into sentences, words, white space, and punctuation.

source (null, undefined, or String): The latin paragraph to parse.

parseLatin.tokenizeSentence(source?)

var Parser = require('parse-latin');

new Parser().tokenizeSentence('A simple sentence.');
/*
 * Object
 * ├─ type: "SentenceNode"
 * └─ children: Array[6]
 *    ├─ 0: Object
 *    |     ├─ type: "WordNode"
 *    |     └─ children: Array[1]
 *    |        └─ 0: Object
 *    |              ├─ type: "TextNode"
 *    |              └─ value: "A"
 *    ├─ 1: Object
 *    |     ├─ type: "WhiteSpaceNode"
 *    |     └─ children: Array[1]
 *    |        └─ 0: Object
 *    |              ├─ type: "TextNode"
 *    |              └─ value: " "
 *    ├─ 2: Object
 *    |     ├─ type: "WordNode"
 *    |     └─ children: Array[1]
 *    |        └─ 0: Object
 *    |              ├─ type: "TextNode"
 *    |              └─ value: "simple"
 *    ├─ 3: Object
 *    |     ├─ type: "WhiteSpaceNode"
 *    |     └─ children: Array[1]
 *    |        └─ 0: Object
 *    |              ├─ type: "TextNode"
 *    |              └─ value: " "
 *    ├─ 4: Object
 *    |     ├─ type: "WordNode"
 *    |     └─ children: Array[1]
 *    |        └─ 0: Object
 *    |              ├─ type: "TextNode"
 *    |              └─ value: "sentence"
 *    └─ 5: Object
 *          ├─ type: "PunctuationNode"
 *          └─ children: Array[1]
 *             └─ 0: Object
 *                   ├─ type: "TextNode"
 *                   └─ value: "."
 */

Tokenise a given sentence into words, white space, and punctuation.

source (null, undefined, or String): The latin sentence to parse.

Browser Support

Pretty much every browser (available through browserstack) runs all parse-latin unit tests.

Benchmark

Run the benchmark yourself:

$ npm run benchmark

On a MacBook Air, it parser about 3 large books, 70 big articles, or 7,803 paragraphs per second.

              parser.tokenizeSentence(source);
  50,117 op/s » A sentence (20 words)

              parser.tokenizeParagraph(source);
  36,559 op/s » A sentence (20 words)
   8,067 op/s » A paragraph (5 sentences, 100 words)

              parser.tokenizeRoot(source);
   7,803 op/s » A paragraph (5 sentences, 100 words)
     764 op/s » A section (10 paragraphs, 50 sentences, 1,000 words)
      70 op/s » An article (100 paragraphs, 500 sentences, 10,000 words)
       3 op/s » A (large) book (1,000 paragraphs, 5,000 sentences, 100,000 words)

License

MIT

Keywords

FAQs

What is parse-latin?

Is parse-latin popular?

Is parse-latin well maintained?

Package last updated on 06 Jul 2014

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

parse-latin

What is parse-latin?

What are parse-latin's main functionalities?

Other packages similar to parse-latin

compromise

natural

Installation

Usage

API

parseLatin.tokenizeRoot(source?)

parseLatin.tokenizeParagraph(source?)

parseLatin.tokenizeSentence(source?)

Browser Support

Benchmark

Related

License

Keywords

Related posts

Massive npm Malware Campaign Leverages Ethereum Smart Contracts To Evade Detection and Maintain Control

Author Typosquatting on npm: Attackers Impersonate Sindre Sorhus with Malicious ‘chalk-node’ Package

Supply Chain Attack on LottieFiles Player Caused by Compromised npmjs Credentials