Words'n'numbers
Extracting arrays of words and optionally numbers and emojis / emoticons from strings. For Node.js and the browser. When you need more than just [a-z]. Part of document processing for search-index and nowsearch.xyz.
Inspired by extractwords
Initiating
Node.js
const wnn = require('words-n-numbers')
Browser
<script src="wnn.js"></script>
<script>
</script>
Use
The default regex should catch every unicode character from for every language.
Only words
let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords)
Only words, converted to lowercase
let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords, { toLowercase: true })
Predefined regex for words and numbers, converted to lowercase
let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords, { regex: wnn.wordsNumbers, toLowercase: true })
Predefined regex for words and emoticons, converted to lowercase
let stringOfWords = 'A ticket to 大阪 costs ¥2000 👌😄 😢'
wnn.extract(stringOfWords, { regex: wnn.wordsEmojis, toLowercase: true })
Predefined regex for numbers and emoticons
let stringOfWords = 'A ticket to 大阪 costs ¥2000 👌😄 😢'
wnn.extract(stringOfWords, { regex: wnn.numbersEmojis, toLowercase: true })
Predefined regex for words, numbers and emoticons, converted to lowercase
let stringOfWords = 'A ticket to 大阪 costs ¥2000 👌😄 😢'
wnn.extract(stringOfWords, { regex: wnn.wordsNumbersEmojis, toLowercase: true })
Custom regex
let stringOfWords = 'This happens at 5 o\'clock !!!'
wnn.extract(stringOfWords, { regex: '[a-z\'0-9]+' })
API
Returns an array of words and optionally numbers.
wnn.extract(stringOfText, \<options-object\>)
Options object
{
regex: '[custom or predefined regex]',
toLowercase: [true / false]
}
Predefined regex'es
wnn.words
wnn.numbers
wnn.emojis
wnn.wordsNumbers
wnn.wordsEmojis
wnn.numbersEmojis
wnn.wordsNumbersEmojis
Languages supported
Supports most languages supported by stopword, and others too. Some languages like Japanese and Chinese simplified needs to be tokenized. May add tokenizers at a later stage.
PR's welcome
PR's and issues are more than welcome =)