synonym-optimizer
Gives a score to a string depending on the variety of the synonyms used.
For instance, let's compare The coffee is good. I love that coffee with The coffee is good. I love that bewerage. The second alternative is better because a synonym is used for coffee. This module will give a better score to the second alternative.
The lowest score the better.
Fully supported languages are French German English and Italian.
What it does / How it works:
- single words are extracted thanks to a tokenizer
wink-tokenizer
- words are lowercased
- stopwords are removed
- for fully supported languages, a default stopwords list is included, which you can customize
- for all other languages, no default list is included, but you can provide a custom stop words lists
- for fully supported languages, words are stemmed using
snowball-stemmer
(for all other languages: no stemming) - when the same word appears multiples times, it raises the score depending on the distance of the two occurrences (if the occurrences are closes it raises the score a lot)
Designed primarly to test the output of a NLG (Natural Language Generation) system.
The stemmer is not perfect. For instance in Italian, cameriere and cameriera have the same stem (camerier), while camerieri and cameriera have a different one (camer and camerier).
Installation
npm install synonym-optimizer
Usage
var synOptimizer = require('synonym-optimizer');
alts = [
'The coffee is good. I love that coffee.',
'The coffee is good. I love that bewerage.'
]
alts.forEach((alt) => {
let score = synOptimizer.scoreAlternative('en_US', alt, null, null, null, null);
console.log(`${alt}: ${score}`);
});
The main function is scoreAlternative
. It takes a string and returns its score. Arguments are:
lang
(string, mandatory): the language.
- fully supported languages are
fr_FR
, en_US
, de_DE
and it_IT
- with any other language (for instance Dutch
nl_NL
) stemming is disabled and stopwords are not removed
alternative
(string, mandatory): the string to scorestopWordsToAdd
(string[], optional): list of stopwords to add to the standard stopwords liststopWordsToRemove
(string[], optional): list of stopwords to remove to the standard stopwords liststopWordsOverride
(string[], optional): replaces the standard stopword listidenticals
(string[][], optional): list of words that should be considered as beeing identical, for instance [ ['phone', 'cellphone', 'smartphone'] ]
.
You can also use the getBest
function. Most arguments are exactly the same, but instead of alternative
, use alternatives
(string[]). The output number will not be the score, but simply the index of the best alternative.
The tokenizer is wink-tokenizer
, it does works with many languages (English, French, German, Hindi, Sanskrit, Marathi etc.) but not asian languages. Therefore the module will not work properly with Japanese, Chinese etc.
Adding new languages (for developpers / maintainers)
- check for existence of stopwords module:
stopwords-*
- check for stemmer in
snowball-stemmer
collection (or plug another stemmer) - plug everything and add tests
- find a proper tokenizer if
wink-tokenizer
does not work
Dependancies and licences
wink-tokenizer
to tokenize sentences in multiple languages (MIT).stopwords-en/de/fs/it
for standard stopwords lists per language (MIT).snowball-stemmer
to stem words per language (MIT).