Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
happynodetokenizer
Advanced tools
A basic Twitter aware tokenizer for Javascript environments.
A Typescript port of HappyFunTokenizer.py by Christopher Potts and HappierFunTokenizing.py by H. Andrew Schwartz.
npm run test
) npm install --save happynodetokenizer
HappyNodeTokenizer exports a function called tokenizer()
which takes an optional configuration object (See "The Options Object" below).
import { tokenizer } from 'happynodetokenizer';
const text = 'RT @ #happyfuncoding: this is a typical Twitter tweet :-)';
// these are the default options
const opts = {
'mode': 'stanford',
'normalize': undefined,
'preserveCase': true,
};
// create a tokenizer instance with our options
const myTokenizer = tokenizer(opts);
// calling myTokenizer returns a generator function
const tokenGenerator = myTokenizer(text);
// you can turn the generator into an array of token objects like this:
const tokens = [...tokenGenerator()];
// you can also convert token objects to array of strings like this:
const values = Array.from(tokens, (token) => token.value);
The tokens
variable in the above example will look like this:
[
{ end: 1, start: 0, tag: 'word', value: 'rt' },
{ end: 3, start: 3, tag: 'punct', value: '@' },
{ end: 19, start: 5, tag: 'hashtag', value: '#happyfuncoding' },
{ end: 20, start: 20, tag: 'punct', value: ':' },
{ end: 25, start: 22, tag: 'word', value: 'this' },
{ end: 28, start: 27, tag: 'word', value: 'is' },
{ end: 30, start: 30, tag: 'word', value: 'a' },
{ end: 38, start: 32, tag: 'word', value: 'typical' },
{ end: 46, start: 40, tag: 'word', value: 'twitter' },
{ end: 52, start: 48, tag: 'word', value: 'tweet' },
{ end: 56, start: 54, tag: 'emoticon', value: ':-)' }
]
Where preserveCase
in the Options Object is false
, each result object may also contain a variation
property which presents the token as originally matched if it differs from the value
property. E.g.:
[
{ end: 1, start: 0, tag: 'word', value: 'rt', variation: 'RT' },
{ end: 3, start: 3, tag: 'punct', value: '@' },
{ end: 19, start: 5, tag: 'hashtag', value: '#happyfuncoding' },
...
{ end: 46, start: 40, tag: 'word', value: 'twitter', variation: 'Twitter' },
...
]
The options object and its properties are optional. The defaults are:
{
'mode': 'stanford',
'normalize': undefined,
'preserveCase': true,
};
string - valid options: stanford
(default), or dlatk
stanford
mode uses the original HappyFunTokenizer pattern. See Github.
dlatk
mode uses the modified HappierFunTokenizing pattern. See Github.
string - valid options: "NFC" | "NFD" | "NFKC" | "NFKD" (default = undefined)
Normalize strings (e.g., when set, mañana becomes manana).
Normalization is disabled with set to null or undefined (default).
boolean - valid options: true
, or false
(default)
Preserves the case of the input string if true, otherwise all tokens are converted to lowercase. Does not affect emoticons.
HappyNodeTokenizer outputs an array of token objects. Each token object has three properties: idx
, value
and tag
. The value
is the token itself, the idx
is the token's original index in the output, the tag
is a descriptor based on one of the following depending on which opt.mode
you are using:
Tag | Stanford | DLATK | Example |
---|---|---|---|
phone | :heavy_check_mark: | :heavy_check_mark: | +1 (800) 123-4567 |
url | :x: | :heavy_check_mark: | http://www.youtube.com |
url_scheme | :x: | :heavy_check_mark: | http:// |
url_authority | :x: | :heavy_check_mark: | [0-3] |
url_path_query | :x: | :heavy_check_mark: | /index.html?s=search |
htmltag | :x: | :heavy_check_mark: | <em class='grumpy'> |
emoticon | :heavy_check_mark: | :heavy_check_mark: | >:( |
username | :heavy_check_mark: | :heavy_check_mark: | @somefaketwitterhandle |
hashtag | :heavy_check_mark: | :heavy_check_mark: | #tokenizing |
punct | :heavy_check_mark: | :heavy_check_mark: | , |
word | :heavy_check_mark: | :heavy_check_mark: | hello |
<UNK> | :heavy_check_mark: | :heavy_check_mark: | (anything left unmatched) |
To compare the results of HappyNodeTokenizer against HappyFunTokenizer and HappierFunTokenizing, run:
npm run test
The goal of this project is to provide an accurate port of HappyFunTokenizer and HappierFunTokenizing. Therefore, any pull requests with test failures will not be accepted.
Based on HappyFunTokenizer.py by Christopher Potts and HappierFunTokenizing.py by H. Andrew Schwartz.
Uses the "he" library by Mathias Bynens under the MIT license.
(C) 2017-24 P. Hughes. All rights reserved.
Shared under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license.
FAQs
A simple, Twitter-aware tokenizer.
The npm package happynodetokenizer receives a total of 19 weekly downloads. As such, happynodetokenizer popularity was classified as not popular.
We found that happynodetokenizer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.