Security News
JavaScript Leaders Demand Oracle Release the JavaScript Trademark
In an open letter, JavaScript community leaders urge Oracle to give up the JavaScript trademark, arguing that it has been effectively abandoned through nonuse.
fuzzyset.js
Advanced tools
fuzzyset is a data structure that performs something akin to fulltext search against data to determine likely mispellings and approximate string matching. Note that this is a javascript port of a python library.
The usage is simple. Just add a string to the set, and ask for it later
by using .get
:
a = FuzzySet();
a.add("michael axiak");
a.get("micael asiak");
// will be [[0.8461538461538461, 'michael axiak']];
The result will be an array of [score, matched_value]
arrays.
The score is between 0 and 1, with 1 being a perfect match.
array
: An array of strings to initialize the data structure withuseLevenshtein
: Whether or not to use the levenshtein distance to determine the match scoring. Default: TruegramSizeLower
: The lower bound of gram sizes to use, inclusive (see Theory of operation). Default: 2gramSizeUpper
: The upper bound of gram sizes to use, inclusive (see Theory of operation). Default: 3get(value, [default], [minScore=.33])
: try to match a string to entries with a score of at least minScore (defaulted to .33), otherwise return null
or default
if it is given.add(value)
: add a value to the set returning false
if it is already in the set.length()
: return the number of items in the set.isEmpty()
: returns true if the set is empty.values()
: returns an array of the values in the set.To play with the library or see how it works, check out the interactive documentation:
First let's look at adding a string, 'michaelich' to an empty set. We first break apart the string into n-grams (strings of length n). So trigrams of 'michaelich' would look like::
'-mi'
'mic'
'ich'
'cha'
'hae'
'ael'
'eli'
'lic'
'ich'
'ch-'
Note that fuzzyset will first normalize the string by removing non word characters except for spaces and commas and force everything to be lowercase.
Next the fuzzyset essentially creates a reverse index on those grams. Maintaining a dictionary that says::
'mic' -> (1, 0)
'ich' -> (2, 0)
...
where the first number is the number of grams and the second number is the index of the item in a list that looks like::
[[3.31, 'michaelich']]
Note that we maintain this reverse index for all grams from gram_size_lower
to gram_size_upper
in the constructor.
This becomes important in a second.
To search the data structure, we take the n-grams of the query string and perform a reverse index look up. To illustrate,
let's consider looking up 'michael'
in our fictitious set containing 'michaelich'
where the gram_size_upper
and gram_size_lower
parameters are default (3 and 2 respectively).
We begin by considering first all trigrams (the value of gram_size_upper
). Those grams are::
'-mi'
'mic'
'ich'
'cha'
'el-'
Then we create a list of any element in the set that has at least one occurrence of a trigram listed above. Note that this is just a dictionary lookup 5 times. For each of these matched elements, we compute the cosine similarity between each element and the query string. We then sort to get the most similar matched elements.
If use_levenshtein
is false, then we return all top matched elements with the same cosine similarity.
If use_levenshtein
is true, then we truncate the possible search space to 50, compute a score based on the levenshtein
distance (so that we handle transpositions), and return based on that.
In the event that none of the trigrams matched, we try the whole thing again with bigrams (note though that if there are no matches, the failure to match will be quick). Bigram searching will always be slower because there will be a much larger set to order.
this:
<script type="text/javascript" src="/path/to/fuzzyset.js"></script>
or:
npm install fuzzyset.js
In a CommonJS environment, the fuzzyset.js
module exports the FuzzySet
function.
BSD
Mike Axiak mike@axiak.net
Glen Chiacchieri (http://glench.com)
FAQs
A fast fuzzy string set for JavaScript
The npm package fuzzyset.js receives a total of 13,440 weekly downloads. As such, fuzzyset.js popularity was classified as popular.
We found that fuzzyset.js demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
In an open letter, JavaScript community leaders urge Oracle to give up the JavaScript trademark, arguing that it has been effectively abandoned through nonuse.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.