wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
data:image/s3,"s3://crabby-images/bbbb8/bbbb8615b3ac55a2ee75cb0ee5d0cd26e7051f31" alt="devDependencies Status"
data:image/s3,"s3://crabby-images/1df20/1df20ae624ee49a671a950238d3bf3d54d6d0759" alt=""
Tokenize sentences in Latin and Devanagari scripts using wink-tokenizer
. It is a part of wink — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS.
Some of it's top feature are:
-
Support for English, French, German, Hindi, Sanskrit, Marathi and many more.
-
Intelligent tokenization of sentence containing words in more than one language.
-
Automatic detection & tagging of token's feature;
- These include word, punctuation, email, mention, hashtag, emoticon, and emoji etc.
Installation
Use npm to install:
npm install wink-tokenizer --save
Getting Started
var tokenizer = require( 'wink-tokenizer' );
var myTokenizer = tokenizer();
var s = '@superman: hit me up on my email r2d2@gmail.com, 2 of us plan party🎉 tom at 3pm:) #fun';
myTokenizer.tokenize( s );
s = 'Mieux vaut prévenir que guérir:-)';
myTokenizer.tokenize( s );
s = 'द्रविड़ ने टेस्ट में ३६ शतक जमाए, उनमें 21 विदेशी playground पर हैं।';
myTokenizer.tokenize( s );
Documentation
Check out the tokenizer API documentation to learn more.
Need Help?
If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.
Copyright & License
wink-tokenizer is copyright 2017-18 GRAYPE Systems Private Limited.
It is licensed under the under the terms of the GNU Affero General Public License as published by the Free
Software Foundation, version 3 of the License.