Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
cldr-segmentation
Advanced tools
Text segmentation library for JavaScript.
This library provides CLDR-based text segmentation capabilities in JavaScript. Text segmentation is the process of identifying word, sentence, and other boundaries in a text. The segmentation rules are published by the Unicode consortium as part of the Common Locale Data Repository, or CLDR, and made freely available to the public.
Good question. Most of the time, that'll probably work fine. However, it's not always obvious where words or sentences should start or end. Consider this sentence:
I like Mrs. Murphy. She's nice.
Splitting only on periods will give you ["I like Mrs. ", "Murphy. ", "She's nice."]
, which probably isn't what you wanted - the period after Mrs
doesn't indicate the end of the sentence.
In addition, other languages use different segmentation rules than English. For example, identifying sentence boundaries in Japanese is a little more difficult because sentences tend to end with \u3002
- the ideographic full stop - as opposed to a period. The CLDR contains support for hundreds of languages, meaning you don't have to consider every language when dealing with international text.
Cldr-segmentation is published as a UMD module meaning it should work in both node via require
and the browser via a <script>
tag. In the browser, use window.cldrSegmentation
to access the library's functionality.
cldrSegmentation.sentenceSplit("I like Mrs. Murphy. She's nice.");
// => ["I like Mrs. ", "Murphy. ", "She's nice."]
You'll notice that Mrs.
was treated as the end of a sentence. To avoid this, use the suppressions for the language you care about. Suppressions are essentially arrays of strings. Each string represents a series of characters after which there should not be a break. Using the English suppressions for the example above yields better results:
var supp = cldrSegmentation.suppressions.en;
cldrSegmentation.sentenceSplit("I like Mrs. Murphy. She's nice.", supp);
// => ["I like Mrs. Murphy. ", "She's nice."]
If you'd like to iterate over each sentence instead of splitting, use a BreakIterator
:
var breakIter = new cldrSegmentation.BreakIterator(supp);
var str = "I like Mrs. Murphy, she's nice.";
breakIter.eachSentence(str, (sentence, start, stop) => {
// do something
});
Suppressions for all languages are available via cldrSegmentation.suppressions.all
.
Word, line, and grapheme cluster segmentation are supported:
cldrSegmentation.wordSplit("I like Mrs. Murphy. She's nice.");
// => ["I", " ", "like", " ", "Mrs", ".", " ", "Murphy", ".", "She's", " ", "nice", "."]
Also available are the lineSplit
and graphemeSplit
functions.
When using a break iterator:
var breakIter = new cldrSegmentation.BreakIterator(supp);
var str = "I like Mrs. Murphy, she's nice.";
breakIter.eachWord(str, (word, start, stop) => {
// do something
});
Also available are the eachLine
and eachGraphemeCluster
functions.
Suppressions are just objects with a single shouldBreak
function that returns a boolean. The function is passed a cursor object positioned at the index of the proposed break. Cursors deal exclusively with Unicode codepoints, meaning your custom suppression logic will need to be implemented in those terms. For example, let's create a custom suppression function that doesn't allow breaks after sentences that end with the letter 't'.
class TeeSuppression {
shouldBreak(cursor) {
var position = cursor.logicalPosition;
// skip backwards past spaces and periods
do {
let cp = cursor.getCodePoint(position);
position --;
} while (cp === 32 || cp === 46);
// we skipped one too many in the loop
position ++;
// if the ending character is 't', return false;
// otherwise return true
return cursor.getCodePoint(position) !== 116;
}
}
Note that you don't have to use ES6 classes. It's equally valid to create a simple object:
let teeSuppression = {
shouldBreak: (cursor) => {
// logic here
}
}
Tests are written in Jasmine and can be executed via jasmine-node:
npm install -g jasmine-node
jasmine-node spec
Written and maintained by Cameron C. Dutro (@camertron).
Copyright 2017 Cameron Dutro, licensed under the MIT license.
2.0.2
FAQs
CLDR text segmentation for JavaScript
We found that cldr-segmentation demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.