wink-nlp-utils
Advanced tools
Comparing version 1.0.1 to 1.0.2
{ | ||
"name": "wink-nlp-utils", | ||
"version": "1.0.1", | ||
"version": "1.0.2", | ||
"description": "Natural Language Processing Utilities that let you tokenize, stem, phonetize, create ngrams, bag of words and more.", | ||
@@ -5,0 +5,0 @@ "keywords": [ |
105
README.md
@@ -1,15 +0,20 @@ | ||
# wink-nlp-utils [![Build Status](https://api.travis-ci.org/decisively/wink-nlp-utils.svg?branch=master)](https://travis-ci.org/decisively/wink-nlp-utils) [![Coverage Status](https://coveralls.io/repos/github/decisively/wink-nlp-utils/badge.svg?branch=master)](https://coveralls.io/github/decisively/wink-nlp-utils?branch=master) | ||
> Natural Language Processing Utilities that let you tokenize, stem, phonetize, create ngrams, bag of words and more. | ||
# wink-nlp-utils | ||
**wink-nlp-utils** is a part of **wink**, which is a collection of Machine Learning utilities. | ||
> Easily tokenize, stem, phonetize, remove stop words, manage elisions, create ngrams, bag of words and more. | ||
Prepares raw text for NLP. It operates on **[strings](#string)** such as names, sentences, paragraphs and **[tokens](#tokens)** represented as an array of strings. | ||
The following code snippet illustrates how to use the *wink-nlp-utils* module | ||
### [![Build Status](https://api.travis-ci.org/decisively/wink-nlp-utils.svg?branch=master)](https://travis-ci.org/decisively/wink-nlp-utils) [![Coverage Status](https://coveralls.io/repos/github/decisively/wink-nlp-utils/badge.svg?branch=master)](https://coveralls.io/github/decisively/wink-nlp-utils?branch=master) | ||
<img align="right" src="https://decisively.github.io/wink-logos/logo-title.png" width="100px" > | ||
**wink-nlp-utils** is a part of **[wink](https://www.npmjs.com/~sanjaya)**, which is a family of Machine Learning NPM packages. They consist of simple and/or higher order functions that can be combined with NodeJS `stream` and `child processes` to create recipes for analytics driven business solutions. | ||
Prepares raw text for Natural Language Processing (NLP). It offers a set of **[APIs](#apis)** to work on **[strings](#string)** such as names, sentences, paragraphs and **[tokens](#tokens)** represented as an array of strings/words. They perform the required pre-processing for ML tasks such as **similarity detection**, **classification**, and **semantic search**. | ||
## Installation | ||
Use **npm** to install: | ||
Use **[npm](https://www.npmjs.com/package/wink-nlp-utils)** to install: | ||
``` | ||
npm install wink-nlp-utils | ||
npm install wink-nlp-utils --save | ||
``` | ||
@@ -23,8 +28,8 @@ | ||
// Load Prepare Text | ||
var prepare = require( 'wink-nlp-utils' ); | ||
// Load wink-nlp-utils | ||
var nlp = require( 'wink-nlp-utils' ); | ||
// Use a string Function | ||
// Input argument is a string | ||
var name = prepare.string.extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' ); | ||
var name = nlp.string.extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' ); | ||
// name -> 'Sarah Connor' | ||
@@ -34,3 +39,3 @@ | ||
// Input argument is an array of tokens; remove stop words. | ||
var t = prepare.tokens.removeWords( [ 'mary', 'had', 'a', 'little', 'lamb' ] ); | ||
var t = nlp.tokens.removeWords( [ 'mary', 'had', 'a', 'little', 'lamb' ] ); | ||
// t -> [ 'mary', 'little', 'lamb' ] | ||
@@ -40,14 +45,16 @@ | ||
## string | ||
## APIs | ||
### lowerCase( s ) | ||
### string | ||
#### lowerCase( s ) | ||
Converts the input string `s` to lower case. | ||
### upperCase( s ) | ||
#### upperCase( s ) | ||
Converts the input sting `s` to upper case. | ||
### trim( s ) | ||
#### trim( s ) | ||
Trims leading and trailing spaces from the input string `s`. | ||
### removeExtraSpaces( s ) | ||
#### removeExtraSpaces( s ) | ||
@@ -57,3 +64,3 @@ Removes leading & trailing white spaces along with any extra spaces appearing in between from the input | ||
### retainAlphaNums( s ) | ||
#### retainAlphaNums( s ) | ||
@@ -63,3 +70,3 @@ Retains only alpha-numerals and spaces and removes all other characters, including leading/trailing/extra spaces from | ||
### extractPersonsName( s ) | ||
#### extractPersonsName( s ) | ||
@@ -72,15 +79,15 @@ Attempts to extract person's name from input string `s` in formats like | ||
### extractRunOfCapitalWords( s ) | ||
#### extractRunOfCapitalWords( s ) | ||
Returns an array of words appearing as Title Case or in ALL CAPS in the input string `s`. | ||
### removePunctuations( s ) | ||
#### removePunctuations( s ) | ||
Removes each punctuation mark by a space. It looks for `.,;!?:"!'... - () [] {}` from the input string `s` and replaces it by a space. Use `removeExtraSpaces( s )` in order to remove the spaces in the string. | ||
### removeSplChars( s ) | ||
#### removeSplChars( s ) | ||
Removes the special characters like `~@#%^*+=` from the input string 's' and replaces it by a space. These can be removed using `removeExtraSpaces( s )`. | ||
### removeHTMLTags( s ) | ||
#### removeHTMLTags( s ) | ||
@@ -90,3 +97,3 @@ Removes HTML tags, escape sequences from the input string `s` and replaces it by space character. These can be removed using `removeExtraSpaces( s )`. | ||
### removeElisions( s ) | ||
#### removeElisions( s ) | ||
@@ -96,39 +103,39 @@ Removes basic elisions found in the input string `s`. An `I'll` becomes `I`, `Isn't` becomes `Is`. An apostrophe found in the string `s` remains as is. | ||
### splitElisions( s ) | ||
#### splitElisions( s ) | ||
Splits elisions from the input string `s` by inserting a space. Elisions like `we're` or `I'm` are split as `we re` or `I m`. | ||
### amplifyNotElision( s ) | ||
#### amplifyNotElision( s ) | ||
Amplifies the not elision by replacing it by the word `not` in the input string `s`; it must be used before calling the `removeElisions()`. `Can't`, `Isn't`, `Haven't` are amplified as `Ca not`, `Is not`, `Have not`. | ||
### marker( s ) | ||
#### marker( s ) | ||
Generates a `marker` for the input string `s` as 1-gram, sorted and joined back as a string again; useful input in determining a quick and aggressive way to detect similarity in short strings. Its aggression leads to more false positives such as `Meter` and `Metre` or `no melon` and `no lemon`. | ||
### soc( s, ifn, idx ) | ||
#### soc( s, ifn, idx ) | ||
Creates a `set of characters (soc)` from the input string `s`. This is useful in even more aggressive string matching using **Jaccard** or **Tversky** Indexes as compared to `marker()`. | ||
### ngram( s, size ) | ||
#### ngram( s, size ) | ||
Generates the ngram of the `size` from the input string `s`. Default value of `size` is 2. The function returns an array of ngrams. In case, `0` is given as `size` parameter, `ngrams` of `size 2` will be returned. | ||
### bong( s, size, ifn, idx ) | ||
#### bong( s, size, ifn, idx ) | ||
Generates a **b**ag **o**f **ng**rams of the `size` from the input string `s`. Default value of `size` is 2. This function returns an object containing ngram (key) and their frequency (value) of occurrence. While `ngram()` preserves the sequence and has no frequency information of each ngram, `bong()` on the other hand captures the frequency of each ngram and has no sequence information. Input arguments `ifn` and `idx` are optional. For special cases, where index is required, please refer to the helper function `index()`. | ||
### song( s, size, ifn, idx ) | ||
#### song( s, size, ifn, idx ) | ||
Generates a **s**et **o**f **ng**rams of the `size` from the input string `s`. Default value of `size` is 2. This function returns an object containing ngram (key) and their frequency (value) of occurrence. While `ngram()` preserves the sequence and has no frequency information of each ngram, `song()` on the other hand captures the frequency of each ngram and has no sequence information. Input arguments `ifn` and `idx` are optional. For special cases, where index is required, please refer to the helper function `index()`. | ||
### stem( s ) | ||
#### stem( s ) | ||
The input string `s` is stemmed using the [Porter2 English Stemming Algorithm](http://snowballstem.org/algorithms/english/stemmer.html) | ||
### sentences( s, splChar ) | ||
#### sentences( s, splChar ) | ||
Splits the text contained in the input string `s` into sentences returned in the form of an array. Punctuation marks found at the end of a sentence are retained. The function can handle sentences beginning with numbers as well, though it is not a good english practice. It uses `~` as the `splChar` for splitting and therefore it must not be present in the input string; else you may give another `splChar` as the second argument. | ||
### tokenize0( s ) | ||
#### tokenize0( s ) | ||
@@ -138,3 +145,3 @@ Tokenizes by splitting the input string `s` on non-words. This means tokens would consists of only alphas, numerals and underscores; all other characters will be stripped as they are treated as separators. However negations are retained and amplified but all other elisions are removed. `Tokenize0` is useful when the text strings are clean and do not require pre-processing like removing punctuations,extra spaces, handling elisions etc. | ||
### tokenize( s ) | ||
#### tokenize( s ) | ||
@@ -152,3 +159,3 @@ The function follows set of rules given below to remove and preserve punctuation/special characters in the input string `s`. The Extra/leading/trailing spaces are removed and finally split on space to tokenize. | ||
### phonetize( s ) | ||
#### phonetize( s ) | ||
@@ -160,3 +167,3 @@ Phonetizes the input string `s` using an algorithmic adaptation of [Metaphone](https://en.wikipedia.org/wiki/Metaphone). | ||
## tokens | ||
### tokens | ||
@@ -166,3 +173,3 @@ Tokens are created by splitting a string into words, keywords, symbols. These tokens are used as an input to various activities during text analysis. | ||
### stem( t ) | ||
#### stem( t ) | ||
@@ -172,16 +179,16 @@ Each element of input array of tokens `t` is stemmed using [Porter2 English Stemming Algorithm](http://snowballstem.org/algorithms/english/stemmer.html). Not to be confused with the stem() under string as it performs stemming on the input string `s`, whereas this function requires an token array `t` as an input. | ||
### bow( t, logcounts ) | ||
#### bow( t, logcounts ) | ||
Creates Bag of Words from the input array of tokens `t`. Specifying the `logCounts` parameter flags the use of `log2`( word counts ) instead of counts directly. The idea behind using `log2` is to ensure that a word’s importance does not increase linearly with its count. It is required as an input for computing similarity using `bow.cosine()`. | ||
### sow( t, ifn, idx ) | ||
#### sow( t, ifn, idx ) | ||
Creates a Set of tokens from the input array `t`. It is required as an input for computing similarity using `Jaccard` or `Tversky` Indexes. Input arguments `ifn` and `idx` are optional, please refer to the function `index()`. | ||
Creates a Set of tokens from the input array `t`. It is required as an input for computing similarity using **Jaccard** or **Tversky** Indexes. Input arguments `ifn` and `idx` are optional, please refer to the function `index()`. | ||
### phonetize( t ) | ||
#### phonetize( t ) | ||
An array of tokens `t` are phonetized using an algorithmic adaptation of [Metaphone](https://en.wikipedia.org/wiki/Metaphone). This is not to be confused with `phonetize( s )` for string only phonetization. | ||
### set( t ) | ||
#### set( t ) | ||
@@ -191,3 +198,3 @@ Creates a `Set of tokens` from the input array `t`. It is required as an input for computing similarity using `Jaccard` or `Tversky` Indexes. This is not to be confused with set( s ) of string sets for computing similarity. | ||
### removeWords( t, givenStopWords ) | ||
#### removeWords( t, givenStopWords ) | ||
@@ -201,3 +208,3 @@ Removes the `givenStopWords` from the input array of tokens `t`. | ||
> #### words( w, givenMappers ) | ||
> ##### words( w, givenMappers ) | ||
@@ -210,3 +217,3 @@ > Creates stop words for `removeWords()` from an array of words (i.e. strings) and | ||
### propagateNegation( t, upto ) | ||
#### propagateNegation( t, upto ) | ||
@@ -218,6 +225,6 @@ It looks for negation tokens in the input array of tokens `t` and propagates negation to subsequent `upto` | ||
## helper | ||
### helper | ||
`helper` name space contains functions which returns function(s). They can be used for generating input arguments by the calling function. | ||
### words( w, givenMappers ) | ||
#### words( w, givenMappers ) | ||
@@ -230,3 +237,3 @@ Returns an object contains the following functions | ||
### index() | ||
#### index() | ||
Builds an index and returns 2 functions as follows: | ||
@@ -237,3 +244,3 @@ | ||
Probing the result() returns `ifn` and `idx` values for the calling function as in n `soc()`,s ong()`, `bong()`, `bow()`, and `sow(). Note: usage of `ifn` are limited by the developer’s imagination! | ||
Probing the result() returns `ifn` and `idx` values for the calling function as in n `soc()`, `song()`, `bong()`, `bow()`, and `sow()`. Note: usage of `ifn` are limited by the developer’s imagination! | ||
@@ -240,0 +247,0 @@ |
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
134125
239