TFIDF
tf-idf is a type of word-analysis that can discover the most-characteristic, or unique words in a text.
It combines uniqueness of words, and their frequency in the document.
This plugin comes pre-built with a standard english model, so you can fingerprint an arbitrary text with .tfidif()
alternatively, you can build your own model, from a compromise document:
let model=nlp(shakespeareWords)
let doc = nlp('thou art so sus.')
doc.tfidf()
if you want to combine tfidf with other analysis, you can add numbers to individual terms, like this:
let doc = nlp('no, my son is also named Bort')
doc.compute('tfidf')
let json = doc.json()
json[0].terms[6]
TF-IDF values are scaled, but have an unbounded maximum. The result for 'foo foo foo foo' would increase every with repitition.
Ngrams
all methods support the same option params:
let doc = nlp('one two three. one two foo.')
doc.ngrams({ size: 2 })
or all gram-sizes under/over a limit:
let doc = nlp('one two three. one two foo.')
let res = doc.ngrams({ min: 3 })
MIT