wink-nlp-utils
Easily tokenize, stem, phonetize, remove stop words, manage elisions, create ngrams, bag of words and more
Prepare raw text for Natural Language Processing (NLP) using wink-nlp-utils
.It is a part of wink — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS.
It offers a set of APIs to work on strings such as names, sentences, paragraphs and tokens represented as an array of strings/words. They perform the required pre-processing for many simple ML tasks such as semantic search, and classification.
Installation
Use npm to install:
npm install wink-nlp-utils --save
Usage
var nlp = require( 'wink-nlp-utils' );
var name = nlp.string.extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' );
var str = '[I] [am having|have] [a] [problem|question]';
console.log( prepare.string.composeCorpus( str ) );
var t = nlp.tokens.removeWords( [ 'mary', 'had', 'a', 'little', 'lamb' ] );
APIs
string
lowerCase( s )
Converts the input string s
to lower case.
upperCase( s )
Converts the input sting s
to upper case.
trim( s )
Trims leading and trailing spaces from the input string s
.
Removes leading & trailing white spaces along with any extra spaces appearing in between from the input
string s
.
retainAlphaNums( s )
Retains only alpha-numerals and spaces and removes all other characters, including leading/trailing/extra spaces from
the input string s
.
Attempts to extract person's name from input string s
in formats like
Dr. Eugine Cyron B. Tech., M. Tech., PhD. - Electrical by dropping
the titles and degrees. It assumes the following name format:
[<salutations>] <name part in FN, MN, LN> [<degrees>]
Returns an array of words appearing as Title Case or in ALL CAPS in the input string s
.
removePunctuations( s )
Removes each punctuation mark by a space. It looks for .,;!?:"!'... - () [] {}
from the input string s
and replaces it by a space. Use removeExtraSpaces( s )
in order to remove the spaces in the string.
removeSplChars( s )
Removes the special characters like ~@#%^*+=
from the input string 's' and replaces it by a space. These can be removed using removeExtraSpaces( s )
.
removeHTMLTags( s )
Removes HTML tags, escape sequences from the input string s
and replaces it by space character. These can be removed using removeExtraSpaces( s )
.
removeElisions( s )
Removes basic elisions found in the input string s
. An I'll
becomes I
, Isn't
becomes Is
. An apostrophe found in the string s
remains as is.
splitElisions( s )
Splits elisions from the input string s
by inserting a space. Elisions like we're
or I'm
are split as we re
or I m
.
amplifyNotElision( s )
Amplifies the not elision by replacing it by the word not
in the input string s
; it must be used before calling the removeElisions()
. Can't
, Isn't
, Haven't
are amplified as Ca not
, Is not
, Have not
.
marker( s )
Generates a marker
for the input string s
as 1-gram, sorted and joined back as a string again; useful input in determining a quick and aggressive way to detect similarity in short strings. Its aggression leads to more false positives such as Meter
and Metre
or no melon
and no lemon
.
soc( s, ifn, idx )
Creates a set of characters (soc)
from the input string s
. This is useful in even more aggressive string matching using Jaccard or Tversky Indexes as compared to marker()
.
ngram( s, size )
Generates the ngram of the size
from the input string s
. Default value of size
is 2. The function returns an array of ngrams. In case, 0
is given as size
parameter, ngrams
of size 2
will be returned.
bong( s, size, ifn, idx )
Generates a bag of ngrams of the size
from the input string s
. Default value of size
is 2. This function returns an object containing ngram (key) and their frequency (value) of occurrence. While ngram()
preserves the sequence and has no frequency information of each ngram, bong()
on the other hand captures the frequency of each ngram and has no sequence information. Input arguments ifn
and idx
are optional. For special cases, where index is required, please refer to the helper function index()
.
song( s, size, ifn, idx )
Generates a set of ngrams of the size
from the input string s
. Default value of size
is 2. This function returns an object containing ngram (key) and their frequency (value) of occurrence. While ngram()
preserves the sequence and has no frequency information of each ngram, song()
on the other hand captures the frequency of each ngram and has no sequence information. Input arguments ifn
and idx
are optional. For special cases, where index is required, please refer to the helper function index()
.
stem( s )
The input string s
is stemmed using the Porter2 English Stemming Algorithm
sentences( s, splChar )
Splits the text contained in the input string s
into sentences returned in the form of an array. Punctuation marks found at the end of a sentence are retained. The function can handle sentences beginning with numbers as well, though it is not a good english practice. It uses ~
as the splChar
for splitting and therefore it must not be present in the input string; else you may give another splChar
as the second argument.
composeCorpus( s )
Generates all possible sentences from the input argument string — s
. The string s
must follow a special syntax as illustrated in the example below:
'[I] [am having|have] [a] [problem|question]'
Each phrase must be quoted between []
and each possible option of phrases (if any) must be separated by a |
character. The corpus is composed by computing the cartesian product of all the phrases as highlighted in the usage section. It
returns an array of sentences (i.e. strings).
tokenize0( s )
Tokenizes by splitting the input string s
on non-words. This means tokens would consists of only alphas, numerals and underscores; all other characters will be stripped as they are treated as separators. However negations are retained and amplified but all other elisions are removed. Tokenize0
is useful when the text strings are clean and do not require pre-processing like removing punctuations,extra spaces, handling elisions etc.
tokenize( s )
The function follows set of rules given below to remove and preserve punctuation/special characters in the input string s
. The Extra/leading/trailing spaces are removed and finally split on space to tokenize.
- First,
single quotes
are processed as they may be a part of elisions in the string; and …
are converted to ellipses. Not
elisions are amplified and then split on elisions. Thus words with elisions get tokenized- The word
cannot
is split in to can not
. . , -
punctuations commonly embedded in numbers are left intact, All other punctuations are tokenized.- The
currency symbols
are padded by space i.e. become separate tokens. Underscore (_)
embedded in the word is preserved.Special characters
are preserved and may/may not become separate tokens.- Finally after removing
extra/leading/trailing spaces
, split on space
to tokenize.
phonetize( s )
Phonetizes the input string s
using an algorithmic adaptation of Metaphone.
soundex( s, maxLength )
Generates the soundex code from the input s
. The maxLength
determines the
maximum length of soundex code to be returned; it is optional and has a default value of 4.
tokens
Tokens are created by splitting a string into words, keywords, symbols. These tokens are used as an input to various activities during text analysis.
stem( t )
Each element of input array of tokens t
is stemmed using Porter2 English Stemming Algorithm. Not to be confused with the stem() under string as it performs stemming on the input string s
, whereas this function requires an token array t
as an input.
bow( t, logcounts )
Creates Bag of Words from the input array of tokens t
. Specifying the logCounts
parameter flags the use of log2
( word counts ) instead of counts directly. The idea behind using log2
is to ensure that a word’s importance does not increase linearly with its count. It is required as an input for computing similarity using bow.cosine()
.
sow( t, ifn, idx )
Creates a Set of tokens from the input array t
. It is required as an input for computing similarity using Jaccard or Tversky Indexes. Input arguments ifn
and idx
are optional, please refer to the function index()
.
phonetize( t )
An array of tokens t
are phonetized using an algorithmic adaptation of Metaphone. This is not to be confused with phonetize( s )
for string only phonetization.
set( t )
Creates a Set of tokens
from the input array t
. It is required as an input for computing similarity using Jaccard
or Tversky
Indexes. This is not to be confused with set( s ) of string sets for computing similarity.
removeWords( t, givenStopWords )
Removes the givenStopWords
from the input array of tokens t
.
If the givenStopWords
parameter is not specified then the default stop words are used.
The list of default stop words are loaded from stop_words.json
located under
the lib/dictionaries/
directory.
The givenStopWords
are constructed using prepare.words()
as outlined below:
words( w, givenMappers )
Creates stop words for removeWords()
from an array of words (i.e. strings) and
serially applies all the mapper functions supplied in the optional givenMappers
array. A mapper should
take a string as an input and return a transformed string. Typical example of a mapper is
prepare.string.phonetize()
.
propagateNegation( t, upto )
It looks for negation tokens in the input array of tokens t
and propagates negation to subsequent upto
tokens by prefixing them by a !
: useful in handling text containing negations
during similarity detection or classification.
bigrams( t )
Generates the bigram from the input tokens t
.
appendBigrams( t )
Generates bigrams from the input t
tokens and returns them by appending to t
.
helper
helper
name space contains functions which returns function(s). They can be used for generating input arguments by the calling function.
words( w, givenMappers )
Returns an object contains the following functions
(a) set()
returns a set of words given in the input array w
.
(b) exclude()
that is suitable for filtering operations.
If the second argument givenMappers
is passed as an array of mapper functions then these are applied on the input array before converting into a set. Typical example of mapper functions are prepare.string.stem()
and prepare.string.phonetize()
.
It has an alias returnWordsFilter()
.
index()
Builds an index and returns 2 functions as follows:
(a) build()
is useful with bag & set creation functions where, by passing the build function, an index of each key/member can be built.
(b) result()
can be probed anytime to access the output of build()
.
Probing the result() returns ifn
and idx
values for the calling function as in n soc()
, song()
, bong()
, bow()
, and sow()
. Note: usage of ifn
are limited by the developer’s imagination!
It has an alias returnIndexer()
.
It returns a function that extracts all occurrences of every quoted text between the leftQuote
and the rightQuote
characters from its argument s
- a string. Both the default quote characters are same — "
. Usage is illustrated below:
var extractQuotedText = prepare.helper.returnQuotedTextExtractor();
console.log( extractQuotedText( 'Raise 2 issues - "fix a bug" & "run tests"' ) );
Need Help?
If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.
Copyright & License
wink-nlp-utils is copyright 2017 GRAYPE Systems Private Limited.
It is licensed under the under the terms of the GNU Affero General Public License as published by the Free
Software Foundation, version 3 of the License.