Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

wink-nlp-utils

Package Overview
Dependencies
Maintainers
3
Versions
25
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

wink-nlp-utils - npm Package Compare versions

Comparing version 1.0.1 to 1.0.2

2

package.json
{
"name": "wink-nlp-utils",
"version": "1.0.1",
"version": "1.0.2",
"description": "Natural Language Processing Utilities that let you tokenize, stem, phonetize, create ngrams, bag of words and more.",

@@ -5,0 +5,0 @@ "keywords": [

@@ -1,15 +0,20 @@

# wink-nlp-utils [![Build Status](https://api.travis-ci.org/decisively/wink-nlp-utils.svg?branch=master)](https://travis-ci.org/decisively/wink-nlp-utils) [![Coverage Status](https://coveralls.io/repos/github/decisively/wink-nlp-utils/badge.svg?branch=master)](https://coveralls.io/github/decisively/wink-nlp-utils?branch=master)
> Natural Language Processing Utilities that let you tokenize, stem, phonetize, create ngrams, bag of words and more.
# wink-nlp-utils
**wink-nlp-utils** is a part of **wink**, which is a collection of Machine Learning utilities.
> Easily tokenize, stem, phonetize, remove stop words, manage elisions, create ngrams, bag of words and more.
Prepares raw text for NLP. It operates on **[strings](#string)** such as names, sentences, paragraphs and **[tokens](#tokens)** represented as an array of strings.
The following code snippet illustrates how to use the *wink-nlp-utils* module
### [![Build Status](https://api.travis-ci.org/decisively/wink-nlp-utils.svg?branch=master)](https://travis-ci.org/decisively/wink-nlp-utils) [![Coverage Status](https://coveralls.io/repos/github/decisively/wink-nlp-utils/badge.svg?branch=master)](https://coveralls.io/github/decisively/wink-nlp-utils?branch=master)
<img align="right" src="https://decisively.github.io/wink-logos/logo-title.png" width="100px" >
**wink-nlp-utils** is a part of **[wink](https://www.npmjs.com/~sanjaya)**, which is a family of Machine Learning NPM packages. They consist of simple and/or higher order functions that can be combined with NodeJS `stream` and `child processes` to create recipes for analytics driven business solutions.
Prepares raw text for Natural Language Processing (NLP). It offers a set of **[APIs](#apis)** to work on **[strings](#string)** such as names, sentences, paragraphs and **[tokens](#tokens)** represented as an array of strings/words. They perform the required pre-processing for ML tasks such as **similarity detection**, **classification**, and **semantic search**.
## Installation
Use **npm** to install:
Use **[npm](https://www.npmjs.com/package/wink-nlp-utils)** to install:
```
npm install wink-nlp-utils
npm install wink-nlp-utils --save
```

@@ -23,8 +28,8 @@

// Load Prepare Text
var prepare = require( 'wink-nlp-utils' );
// Load wink-nlp-utils
var nlp = require( 'wink-nlp-utils' );
// Use a string Function
// Input argument is a string
var name = prepare.string.extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' );
var name = nlp.string.extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' );
// name -> 'Sarah Connor'

@@ -34,3 +39,3 @@

// Input argument is an array of tokens; remove stop words.
var t = prepare.tokens.removeWords( [ 'mary', 'had', 'a', 'little', 'lamb' ] );
var t = nlp.tokens.removeWords( [ 'mary', 'had', 'a', 'little', 'lamb' ] );
// t -> [ 'mary', 'little', 'lamb' ]

@@ -40,14 +45,16 @@

## string
## APIs
### lowerCase( s )
### string
#### lowerCase( s )
Converts the input string `s` to lower case.
### upperCase( s )
#### upperCase( s )
Converts the input sting `s` to upper case.
### trim( s )
#### trim( s )
Trims leading and trailing spaces from the input string `s`.
### removeExtraSpaces( s )
#### removeExtraSpaces( s )

@@ -57,3 +64,3 @@ Removes leading & trailing white spaces along with any extra spaces appearing in between from the input

### retainAlphaNums( s )
#### retainAlphaNums( s )

@@ -63,3 +70,3 @@ Retains only alpha-numerals and spaces and removes all other characters, including leading/trailing/extra spaces from

### extractPersonsName( s )
#### extractPersonsName( s )

@@ -72,15 +79,15 @@ Attempts to extract person's name from input string `s` in formats like

### extractRunOfCapitalWords( s )
#### extractRunOfCapitalWords( s )
Returns an array of words appearing as Title Case or in ALL CAPS in the input string `s`.
### removePunctuations( s )
#### removePunctuations( s )
Removes each punctuation mark by a space. It looks for `.,;!?:"!'... - () [] {}` from the input string `s` and replaces it by a space. Use `removeExtraSpaces( s )` in order to remove the spaces in the string.
### removeSplChars( s )
#### removeSplChars( s )
Removes the special characters like `~@#%^*+=` from the input string 's' and replaces it by a space. These can be removed using `removeExtraSpaces( s )`.
### removeHTMLTags( s )
#### removeHTMLTags( s )

@@ -90,3 +97,3 @@ Removes HTML tags, escape sequences from the input string `s` and replaces it by space character. These can be removed using `removeExtraSpaces( s )`.

### removeElisions( s )
#### removeElisions( s )

@@ -96,39 +103,39 @@ Removes basic elisions found in the input string `s`. An `I'll` becomes `I`, `Isn't` becomes `Is`. An apostrophe found in the string `s` remains as is.

### splitElisions( s )
#### splitElisions( s )
Splits elisions from the input string `s` by inserting a space. Elisions like `we're` or `I'm` are split as `we re` or `I m`.
### amplifyNotElision( s )
#### amplifyNotElision( s )
Amplifies the not elision by replacing it by the word `not` in the input string `s`; it must be used before calling the `removeElisions()`. `Can't`, `Isn't`, `Haven't` are amplified as `Ca not`, `Is not`, `Have not`.
### marker( s )
#### marker( s )
Generates a `marker` for the input string `s` as 1-gram, sorted and joined back as a string again; useful input in determining a quick and aggressive way to detect similarity in short strings. Its aggression leads to more false positives such as `Meter` and `Metre` or `no melon` and `no lemon`.
### soc( s, ifn, idx )
#### soc( s, ifn, idx )
Creates a `set of characters (soc)` from the input string `s`. This is useful in even more aggressive string matching using **Jaccard** or **Tversky** Indexes as compared to `marker()`.
### ngram( s, size )
#### ngram( s, size )
Generates the ngram of the `size` from the input string `s`. Default value of `size` is 2. The function returns an array of ngrams. In case, `0` is given as `size` parameter, `ngrams` of `size 2` will be returned.
### bong( s, size, ifn, idx )
#### bong( s, size, ifn, idx )
Generates a **b**ag **o**f **ng**rams of the `size` from the input string `s`. Default value of `size` is 2. This function returns an object containing ngram (key) and their frequency (value) of occurrence. While `ngram()` preserves the sequence and has no frequency information of each ngram, `bong()` on the other hand captures the frequency of each ngram and has no sequence information. Input arguments `ifn` and `idx` are optional. For special cases, where index is required, please refer to the helper function `index()`.
### song( s, size, ifn, idx )
#### song( s, size, ifn, idx )
Generates a **s**et **o**f **ng**rams of the `size` from the input string `s`. Default value of `size` is 2. This function returns an object containing ngram (key) and their frequency (value) of occurrence. While `ngram()` preserves the sequence and has no frequency information of each ngram, `song()` on the other hand captures the frequency of each ngram and has no sequence information. Input arguments `ifn` and `idx` are optional. For special cases, where index is required, please refer to the helper function `index()`.
### stem( s )
#### stem( s )
The input string `s` is stemmed using the [Porter2 English Stemming Algorithm](http://snowballstem.org/algorithms/english/stemmer.html)
### sentences( s, splChar )
#### sentences( s, splChar )
Splits the text contained in the input string `s` into sentences returned in the form of an array. Punctuation marks found at the end of a sentence are retained. The function can handle sentences beginning with numbers as well, though it is not a good english practice. It uses `~` as the `splChar` for splitting and therefore it must not be present in the input string; else you may give another `splChar` as the second argument.
### tokenize0( s )
#### tokenize0( s )

@@ -138,3 +145,3 @@ Tokenizes by splitting the input string `s` on non-words. This means tokens would consists of only alphas, numerals and underscores; all other characters will be stripped as they are treated as separators. However negations are retained and amplified but all other elisions are removed. `Tokenize0` is useful when the text strings are clean and do not require pre-processing like removing punctuations,extra spaces, handling elisions etc.

### tokenize( s )
#### tokenize( s )

@@ -152,3 +159,3 @@ The function follows set of rules given below to remove and preserve punctuation/special characters in the input string `s`. The Extra/leading/trailing spaces are removed and finally split on space to tokenize.

### phonetize( s )
#### phonetize( s )

@@ -160,3 +167,3 @@ Phonetizes the input string `s` using an algorithmic adaptation of [Metaphone](https://en.wikipedia.org/wiki/Metaphone).

## tokens
### tokens

@@ -166,3 +173,3 @@ Tokens are created by splitting a string into words, keywords, symbols. These tokens are used as an input to various activities during text analysis.

### stem( t )
#### stem( t )

@@ -172,16 +179,16 @@ Each element of input array of tokens `t` is stemmed using [Porter2 English Stemming Algorithm](http://snowballstem.org/algorithms/english/stemmer.html). Not to be confused with the stem() under string as it performs stemming on the input string `s`, whereas this function requires an token array `t` as an input.

### bow( t, logcounts )
#### bow( t, logcounts )
Creates Bag of Words from the input array of tokens `t`. Specifying the `logCounts` parameter flags the use of `log2`( word counts ) instead of counts directly. The idea behind using `log2` is to ensure that a word’s importance does not increase linearly with its count. It is required as an input for computing similarity using `bow.cosine()`.
### sow( t, ifn, idx )
#### sow( t, ifn, idx )
Creates a Set of tokens from the input array `t`. It is required as an input for computing similarity using `Jaccard` or `Tversky` Indexes. Input arguments `ifn` and `idx` are optional, please refer to the function `index()`.
Creates a Set of tokens from the input array `t`. It is required as an input for computing similarity using **Jaccard** or **Tversky** Indexes. Input arguments `ifn` and `idx` are optional, please refer to the function `index()`.
### phonetize( t )
#### phonetize( t )
An array of tokens `t` are phonetized using an algorithmic adaptation of [Metaphone](https://en.wikipedia.org/wiki/Metaphone). This is not to be confused with `phonetize( s )` for string only phonetization.
### set( t )
#### set( t )

@@ -191,3 +198,3 @@ Creates a `Set of tokens` from the input array `t`. It is required as an input for computing similarity using `Jaccard` or `Tversky` Indexes. This is not to be confused with set( s ) of string sets for computing similarity.

### removeWords( t, givenStopWords )
#### removeWords( t, givenStopWords )

@@ -201,3 +208,3 @@ Removes the `givenStopWords` from the input array of tokens `t`.

> #### words( w, givenMappers )
> ##### words( w, givenMappers )

@@ -210,3 +217,3 @@ > Creates stop words for `removeWords()` from an array of words (i.e. strings) and

### propagateNegation( t, upto )
#### propagateNegation( t, upto )

@@ -218,6 +225,6 @@ It looks for negation tokens in the input array of tokens `t` and propagates negation to subsequent `upto`

## helper
### helper
`helper` name space contains functions which returns function(s). They can be used for generating input arguments by the calling function.
### words( w, givenMappers )
#### words( w, givenMappers )

@@ -230,3 +237,3 @@ Returns an object contains the following functions

### index()
#### index()
Builds an index and returns 2 functions as follows:

@@ -237,3 +244,3 @@

Probing the result() returns `ifn` and `idx` values for the calling function as in n `soc()`,s ong()`, `bong()`, `bow()`, and `sow(). Note: usage of `ifn` are limited by the developer’s imagination!
Probing the result() returns `ifn` and `idx` values for the calling function as in n `soc()`, `song()`, `bong()`, `bow()`, and `sow()`. Note: usage of `ifn` are limited by the developer’s imagination!

@@ -240,0 +247,0 @@

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc