Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

compromise-stats

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

compromise-stats

plugin for nlp-compromise

0.1.0
latest
Source
npm

Version published: 3 years ago

Weekly downloads: 1.8K; decreased by-30.78%

Maintainers: 1

Weekly downloads

Created: 3 years ago

Source

nlp statistics plugin for compromise

npm install compromise-stats

TFIDF

tf-idf is a type of word-analysis that can discover the most-characteristic, or unique words in a text. It combines uniqueness of words, and their frequency in the document. This plugin comes pre-built with a standard english model, so you can fingerprint an arbitrary text with .tfidif()

.tfidf(opts, model?) -

alternatively, you can build your own model, from a compromise document:

.buildIDF() -

let model=nlp(shakespeareWords)
let doc = nlp('thou art so sus.')
doc.tfidf()
// [ [ 'sus', 5.78 ], [ 'thou', 2.3 ], [ 'art', 1.75 ], [ 'so', 0.44 ] ]

if you want to combine tfidf with other analysis, you can add numbers to individual terms, like this:

let doc = nlp('no, my son is also named Bort')
doc.compute('tfidf')
let json = doc.json()
json[0].terms[6]
// {"text":"Bort", "tags":[], "tfidf":5.78, ... }

TF-IDF values are scaled, but have an unbounded maximum. The result for 'foo foo foo foo' would increase every with repitition.

Ngrams

.ngrams({}) - list all repeating sub-phrases, by word-count
.unigrams() - n-grams with one word
.bigrams() - n-grams with two words
.trigrams() - n-grams with three words
.startgrams() - n-grams including the first term of a phrase
.endgrams() - n-grams including the last term of a phrase
.edgegrams() - n-grams including the first or last term of a phrase

all methods support the same option params:

let doc = nlp('one two three. one two foo.')
doc.ngrams({ size: 2 }) // only two-word grams
/*[
  { size: 2, count: 2, normal: 'one two' },
  { size: 2, count: 1, normal: 'two three' },
  { size: 2, count: 1, normal: 'two foo' }
]
*/

or all gram-sizes under/over a limit:

let doc = nlp('one two three. one two foo.')
let res = doc.ngrams({ min: 3 }) // or max:2
/*[
  { size: 3, count: 1, normal: 'one two three' },
  { size: 3, count: 1, normal: 'one two foo' }
]
*/

MIT

10.1.0

fix return format of .isPlural(), so it acts like a match filter
less-greedy date tagging & ambiguous month fixes

v10

cleanup & rename some .value() methods
change lumping behaviour of lexicon terms with multiple words
keep more former tags after a term replace method
new .random() method
new .lessThan(), .greaterThan(), .equalTo() methods
new prefix/suffix/infix matches with _ffix syntax
tag() supports a sequence of tags for a sequence of terms
.match 'range' queries now use a real match - #Adverb{2,4}
new .before() and .after() match methods
removes .lexicon() method for many-lexicons concept
changes params of .replaceWith() method to a 'keyTags' boolean
improved .debug() and logging on client-side

FAQs

What is compromise-stats?

Is compromise-stats popular?

Is compromise-stats well maintained?

Package last updated on 01 Jun 2022

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

compromise-stats

TFIDF

Ngrams

10.1.0

v10

Related posts

PyPI on Ultralytics Supply Chain Attack: Poor CI/CD Practices to Blame, No Security Flaws in PyPI Exploited

Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm