wink-nlp-utils - npm Package Compare versions

Comparing version 1.0.1 to 1.0.2

package.json

		{
		"name": "wink-nlp-utils",
		"version": "1.0.1",
		"version": "1.0.2",
		"description": "Natural Language Processing Utilities that let you tokenize, stem, phonetize, create ngrams, bag of words and more.",
		@@ -5,0 +5,0 @@ "keywords": [

105

README.md

		@@ -1,15 +0,20 @@
		# wink-nlp-utils [![Build Status](https://api.travis-ci.org/decisively/wink-nlp-utils.svg?branch=master)](https://travis-ci.org/decisively/wink-nlp-utils) [![Coverage Status](https://coveralls.io/repos/github/decisively/wink-nlp-utils/badge.svg?branch=master)](https://coveralls.io/github/decisively/wink-nlp-utils?branch=master)

		> Natural Language Processing Utilities that let you tokenize, stem, phonetize, create ngrams, bag of words and more.
		# wink-nlp-utils

		wink-nlp-utils is a part of wink, which is a collection of Machine Learning utilities.
		> Easily tokenize, stem, phonetize, remove stop words, manage elisions, create ngrams, bag of words and more.

		Prepares raw text for NLP. It operates on [strings](#string) such as names, sentences, paragraphs and [tokens](#tokens) represented as an array of strings.
		The following code snippet illustrates how to use the wink-nlp-utils module
		### [![Build Status](https://api.travis-ci.org/decisively/wink-nlp-utils.svg?branch=master)](https://travis-ci.org/decisively/wink-nlp-utils) [![Coverage Status](https://coveralls.io/repos/github/decisively/wink-nlp-utils/badge.svg?branch=master)](https://coveralls.io/github/decisively/wink-nlp-utils?branch=master)

		<img align="right" src="https://decisively.github.io/wink-logos/logo-title.png" width="100px" >

		wink-nlp-utils is a part of [wink](https://www.npmjs.com/~sanjaya), which is a family of Machine Learning NPM packages. They consist of simple and/or higher order functions that can be combined with NodeJS `stream` and `child processes` to create recipes for analytics driven business solutions.


		Prepares raw text for Natural Language Processing (NLP). It offers a set of [APIs](#apis) to work on [strings](#string) such as names, sentences, paragraphs and [tokens](#tokens) represented as an array of strings/words. They perform the required pre-processing for ML tasks such as similarity detection, classification, and semantic search.


		## Installation
		Use npm to install:
		Use [npm](https://www.npmjs.com/package/wink-nlp-utils) to install:
		```
		npm install wink-nlp-utils
		npm install wink-nlp-utils --save
		```
		@@ -23,8 +28,8 @@

		// Load Prepare Text
		var prepare = require( 'wink-nlp-utils' );
		// Load wink-nlp-utils
		var nlp = require( 'wink-nlp-utils' );

		// Use a string Function
		// Input argument is a string
		var name = prepare.string.extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' );
		var name = nlp.string.extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' );
		// name -> 'Sarah Connor'
		@@ -34,3 +39,3 @@
		// Input argument is an array of tokens; remove stop words.
		var t = prepare.tokens.removeWords( [ 'mary', 'had', 'a', 'little', 'lamb' ] );
		var t = nlp.tokens.removeWords( [ 'mary', 'had', 'a', 'little', 'lamb' ] );
		// t -> [ 'mary', 'little', 'lamb' ]
		@@ -40,14 +45,16 @@

		## string
		## APIs

		### lowerCase( s )
		### string

		#### lowerCase( s )
		Converts the input string `s` to lower case.

		### upperCase( s )
		#### upperCase( s )
		Converts the input sting `s` to upper case.

		### trim( s )
		#### trim( s )
		Trims leading and trailing spaces from the input string `s`.

		### removeExtraSpaces( s )
		#### removeExtraSpaces( s )

		@@ -57,3 +64,3 @@ Removes leading & trailing white spaces along with any extra spaces appearing in between from the input

		### retainAlphaNums( s )
		#### retainAlphaNums( s )

		@@ -63,3 +70,3 @@ Retains only alpha-numerals and spaces and removes all other characters, including leading/trailing/extra spaces from

		### extractPersonsName( s )
		#### extractPersonsName( s )

		@@ -72,15 +79,15 @@ Attempts to extract person's name from input string `s` in formats like

		### extractRunOfCapitalWords( s )
		#### extractRunOfCapitalWords( s )

		Returns an array of words appearing as Title Case or in ALL CAPS in the input string `s`.

		### removePunctuations( s )
		#### removePunctuations( s )

		Removes each punctuation mark by a space. It looks for `.,;!?:"!'... - () [] {}` from the input string `s` and replaces it by a space. Use `removeExtraSpaces( s )` in order to remove the spaces in the string.

		### removeSplChars( s )
		#### removeSplChars( s )

		Removes the special characters like `~@#%^*+=` from the input string 's' and replaces it by a space. These can be removed using `removeExtraSpaces( s )`.

		### removeHTMLTags( s )
		#### removeHTMLTags( s )

		@@ -90,3 +97,3 @@ Removes HTML tags, escape sequences from the input string `s` and replaces it by space character. These can be removed using `removeExtraSpaces( s )`.

		### removeElisions( s )
		#### removeElisions( s )

		@@ -96,39 +103,39 @@ Removes basic elisions found in the input string `s`. An `I'll` becomes `I`, `Isn't` becomes `Is`. An apostrophe found in the string `s` remains as is.

		### splitElisions( s )
		#### splitElisions( s )

		Splits elisions from the input string `s` by inserting a space. Elisions like `we're` or `I'm` are split as `we re` or `I m`.

		### amplifyNotElision( s )
		#### amplifyNotElision( s )

		Amplifies the not elision by replacing it by the word `not` in the input string `s`; it must be used before calling the `removeElisions()`. `Can't`, `Isn't`, `Haven't` are amplified as `Ca not`, `Is not`, `Have not`.

		### marker( s )
		#### marker( s )

		Generates a `marker` for the input string `s` as 1-gram, sorted and joined back as a string again; useful input in determining a quick and aggressive way to detect similarity in short strings. Its aggression leads to more false positives such as `Meter` and `Metre` or `no melon` and `no lemon`.

		### soc( s, ifn, idx )
		#### soc( s, ifn, idx )

		Creates a `set of characters (soc)` from the input string `s`. This is useful in even more aggressive string matching using Jaccard or Tversky Indexes as compared to `marker()`.

		### ngram( s, size )
		#### ngram( s, size )

		Generates the ngram of the `size` from the input string `s`. Default value of `size` is 2. The function returns an array of ngrams. In case, `0` is given as `size` parameter, `ngrams` of `size 2` will be returned.

		### bong( s, size, ifn, idx )
		#### bong( s, size, ifn, idx )

		Generates a bag of ngrams of the `size` from the input string `s`. Default value of `size` is 2. This function returns an object containing ngram (key) and their frequency (value) of occurrence. While `ngram()` preserves the sequence and has no frequency information of each ngram, `bong()` on the other hand captures the frequency of each ngram and has no sequence information. Input arguments `ifn` and `idx` are optional. For special cases, where index is required, please refer to the helper function `index()`.

		### song( s, size, ifn, idx )
		#### song( s, size, ifn, idx )

		Generates a set of ngrams of the `size` from the input string `s`. Default value of `size` is 2. This function returns an object containing ngram (key) and their frequency (value) of occurrence. While `ngram()` preserves the sequence and has no frequency information of each ngram, `song()` on the other hand captures the frequency of each ngram and has no sequence information. Input arguments `ifn` and `idx` are optional. For special cases, where index is required, please refer to the helper function `index()`.

		### stem( s )
		#### stem( s )

		The input string `s` is stemmed using the [Porter2 English Stemming Algorithm](http://snowballstem.org/algorithms/english/stemmer.html)

		### sentences( s, splChar )
		#### sentences( s, splChar )

		Splits the text contained in the input string `s` into sentences returned in the form of an array. Punctuation marks found at the end of a sentence are retained. The function can handle sentences beginning with numbers as well, though it is not a good english practice. It uses `~` as the `splChar` for splitting and therefore it must not be present in the input string; else you may give another `splChar` as the second argument.

		### tokenize0( s )
		#### tokenize0( s )

		@@ -138,3 +145,3 @@ Tokenizes by splitting the input string `s` on non-words. This means tokens would consists of only alphas, numerals and underscores; all other characters will be stripped as they are treated as separators. However negations are retained and amplified but all other elisions are removed. `Tokenize0` is useful when the text strings are clean and do not require pre-processing like removing punctuations,extra spaces, handling elisions etc.

		### tokenize( s )
		#### tokenize( s )

		@@ -152,3 +159,3 @@ The function follows set of rules given below to remove and preserve punctuation/special characters in the input string `s`. The Extra/leading/trailing spaces are removed and finally split on space to tokenize.

		### phonetize( s )
		#### phonetize( s )

		@@ -160,3 +167,3 @@ Phonetizes the input string `s` using an algorithmic adaptation of [Metaphone](https://en.wikipedia.org/wiki/Metaphone).

		## tokens
		### tokens

		@@ -166,3 +173,3 @@ Tokens are created by splitting a string into words, keywords, symbols. These tokens are used as an input to various activities during text analysis.

		### stem( t )
		#### stem( t )

		@@ -172,16 +179,16 @@ Each element of input array of tokens `t` is stemmed using [Porter2 English Stemming Algorithm](http://snowballstem.org/algorithms/english/stemmer.html). Not to be confused with the stem() under string as it performs stemming on the input string `s`, whereas this function requires an token array `t` as an input.

		### bow( t, logcounts )
		#### bow( t, logcounts )

		Creates Bag of Words from the input array of tokens `t`. Specifying the `logCounts` parameter flags the use of `log2`( word counts ) instead of counts directly. The idea behind using `log2` is to ensure that a word’s importance does not increase linearly with its count. It is required as an input for computing similarity using `bow.cosine()`.

		### sow( t, ifn, idx )
		#### sow( t, ifn, idx )

		Creates a Set of tokens from the input array `t`. It is required as an input for computing similarity using `Jaccard` or `Tversky` Indexes. Input arguments `ifn` and `idx` are optional, please refer to the function `index()`.
		Creates a Set of tokens from the input array `t`. It is required as an input for computing similarity using Jaccard or Tversky Indexes. Input arguments `ifn` and `idx` are optional, please refer to the function `index()`.


		### phonetize( t )
		#### phonetize( t )

		An array of tokens `t` are phonetized using an algorithmic adaptation of [Metaphone](https://en.wikipedia.org/wiki/Metaphone). This is not to be confused with `phonetize( s )` for string only phonetization.

		### set( t )
		#### set( t )

		@@ -191,3 +198,3 @@ Creates a `Set of tokens` from the input array `t`. It is required as an input for computing similarity using `Jaccard` or `Tversky` Indexes. This is not to be confused with set( s ) of string sets for computing similarity.

		### removeWords( t, givenStopWords )
		#### removeWords( t, givenStopWords )

		@@ -201,3 +208,3 @@ Removes the `givenStopWords` from the input array of tokens `t`.

		> #### words( w, givenMappers )
		> ##### words( w, givenMappers )

		@@ -210,3 +217,3 @@ > Creates stop words for `removeWords()` from an array of words (i.e. strings) and

		### propagateNegation( t, upto )
		#### propagateNegation( t, upto )

		@@ -218,6 +225,6 @@ It looks for negation tokens in the input array of tokens `t` and propagates negation to subsequent `upto`

		## helper
		### helper
		`helper` name space contains functions which returns function(s). They can be used for generating input arguments by the calling function.

		### words( w, givenMappers )
		#### words( w, givenMappers )

		@@ -230,3 +237,3 @@ Returns an object contains the following functions

		### index()
		#### index()
		Builds an index and returns 2 functions as follows:
		@@ -237,3 +244,3 @@

		Probing the result() returns `ifn` and `idx` values for the calling function as in n `soc()`,s ong()`, `bong()`, `bow()`, and `sow(). Note: usage of `ifn` are limited by the developer’s imagination!
		Probing the result() returns `ifn` and `idx` values for the calling function as in n `soc()`, `song()`, `bong()`, `bow()`, and `sow()`. Note: usage of `ifn` are limited by the developer’s imagination!

		@@ -240,0 +247,0 @@

wink-nlp-utils - npm Package Compare versions

Improved metrics