wink-tokenizer
Advanced tools
Comparing version 2.0.1 to 2.1.0
{ | ||
"name": "wink-tokenizer", | ||
"version": "2.0.1", | ||
"description": "Versatile tokenizer that automatically tags each token with its type", | ||
"version": "2.1.0", | ||
"description": "Multilingual tokenizer that automatically tags each token with its type", | ||
"keywords": [ | ||
@@ -16,2 +16,7 @@ "Tokenizer", | ||
"Emoticon", | ||
"Multilingual", | ||
"French", | ||
"German", | ||
"Spanish", | ||
"Icelandic", | ||
"wink" | ||
@@ -18,0 +23,0 @@ ], |
# wink-tokenizer | ||
Versatile tokenizer that automatically tags each token with its type | ||
Multilingual tokenizer that automatically tags each token with its type | ||
### [![Build Status](https://api.travis-ci.org/winkjs/wink-tokenizer.svg?branch=master)](https://travis-ci.org/winkjs/wink-tokenizer) [![Coverage Status](https://coveralls.io/repos/github/winkjs/wink-tokenizer/badge.svg?branch=master)](https://coveralls.io/github/winkjs/wink-tokenizer?branch=master) [![devDependencies Status](https://david-dm.org/winkjs/wink-tokenizer/dev-status.svg)](https://david-dm.org/winkjs/wink-tokenizer?type=dev) | ||
### [![Build Status](https://api.travis-ci.org/winkjs/wink-tokenizer.svg?branch=master)](https://travis-ci.org/winkjs/wink-tokenizer) [![Coverage Status](https://coveralls.io/repos/github/winkjs/wink-tokenizer/badge.svg?branch=master)](https://coveralls.io/github/winkjs/wink-tokenizer?branch=master) [![Inline docs](http://inch-ci.org/github/winkjs/wink-tokenizer.svg?branch=master)](http://inch-ci.org/github/winkjs/wink-tokenizer) [![devDependencies Status](https://david-dm.org/winkjs/wink-tokenizer/dev-status.svg)](https://david-dm.org/winkjs/wink-tokenizer?type=dev) | ||
[<img align="right" src="https://decisively.github.io/wink-logos/logo-title.png" width="100px" >](http://winkjs.org/) | ||
Tokenize sentences and also automatically tag each token as either word, email, twitter handle, or more using **`wink-tokenizer`**. It is a part of [wink](http://winkjs.org/) — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS. | ||
Tokenize sentences in English, French, German, Spanish, and Icelandic using **`wink-tokenizer`**. It is a part of [wink](http://winkjs.org/) — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS. | ||
It automatically tags each token as either word, email, twitter handle, or more. | ||
### Installation | ||
@@ -17,3 +19,3 @@ | ||
### Example | ||
### Getting Started | ||
```javascript | ||
@@ -24,3 +26,4 @@ // Load tokenizer. | ||
var myTokenizer = tokenizer(); | ||
// Just tokenize the sentence... | ||
// Tokenize a tweet. | ||
var s = '@superman: hit me up on my email r2d2@gmail.com, 2 of us plan party🎉 tom at 3pm:) #fun'; | ||
@@ -49,2 +52,12 @@ myTokenizer.tokenize( s ); | ||
// { value: '#fun', tag: 'hashtag' } ] | ||
// Tokenize a french sentence. | ||
s = 'Mieux vaut prévenir que guérir:-)'; | ||
myTokenizer.tokenize( s ); | ||
// -> [ { value: 'Mieux', tag: 'word' }, | ||
// { value: 'vaut', tag: 'word' }, | ||
// { value: 'prévenir', tag: 'word' }, | ||
// { value: 'que', tag: 'word' }, | ||
// { value: 'guérir', tag: 'word' }, | ||
// { value: ':-)', tag: 'emoticon' } ] | ||
``` | ||
@@ -61,5 +74,5 @@ | ||
**wink-tokenizer** is copyright 2017 [GRAYPE Systems Private Limited](http://graype.in/). | ||
**wink-tokenizer** is copyright 2017-18 [GRAYPE Systems Private Limited](http://graype.in/). | ||
It is licensed under the under the terms of the GNU Affero General Public License as published by the Free | ||
Software Foundation, version 3 of the License. |
// wink-tokenizer | ||
// Versatile tokenizer that automatically tags each token with its type. | ||
// Multilingual tokenizer that automatically tags each token with its type. | ||
// | ||
@@ -36,3 +36,4 @@ // Copyright (C) 2017 GRAYPE Systems Private Limited | ||
var rgxTime = /(?:\d|[01]\d|2[0-3]):?(?:[0-5][0-9])?\s?(?:[ap]m|hours|hrs)\b/gi; | ||
var rgxWord = /[a-z]+\'[a-z]{1,2}|[a-z]+s\'|[a-z]+/gi; | ||
// Inlcude [Latin-1 Supplement Unicode Block](https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)) | ||
var rgxWord = /[a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]+\'[a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]{1,2}|[a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]+s\'|[a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]+/gi; | ||
// Special regex to handle not elisions at sentence level itself. | ||
@@ -39,0 +40,0 @@ var rgxNotElision = /([a-z])(n\'t)\b/gi; |
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
50367
293
75