Security News
GitHub Removes Malicious Pull Requests Targeting Open Source Repositories
GitHub removed 27 malicious pull requests attempting to inject harmful code across multiple open source repositories, in another round of low-effort attacks.
@leeoniya/ufuzzy
Advanced tools
A tiny, efficient fuzzy search that doesn't suck. This is my fuzzy 🐈. There are many like it, but this one is mine.¹
uFuzzy is a fuzzy search library designed to match a relatively short search phrase (needle) against a large list of short-to-medium phrases (haystack). It might be best described as a more forgiving String.indexOf(). Common use cases are list filtering, auto-complete/suggest, and title/name/description/filename/function searches.
In uFuzzy's default MultiInsert mode, each match must contain all alpha-numeric characters from the needle in the same sequence;
in SingleError mode, single typos are tolerated in each term (like Damerau–Levenshtein distance = 1, but much faster).
Its .search()
API can efficiently match out-of-order terms and supports multiple substring exclusions, e.g. fruit -green -melon
.
When held just right, it can efficiently match against multiple object properties, too.
Array.sort()
which gets access to each match's stats/counters. There's no composite, black box "score" to understand.uFuzzy is optimized for the Latin/Roman alphabet and relies internally on non-unicode regular expressions.
The uFuzzy.latinize()
util function may be used to strip common accents/diacritics from the haystack and needle prior to searching.
It should be possible to support other scripts (Cyrillic, Chinese, Arabic, Greek, etc) by setting {unicode: true}
and replacing various uFuzzy opts e.g. [A-Z]
with \p{Alpha}
or \p{sc=Cyrillic}
.
More examples.
Latin + Cyrillic can also be supported without the unicode flag by adding a charset range to an ASCII regexps, e.g. [\wа-яё]
.
There are likely performance implications for using unicode regexps that should be considered.
If you're interested in assisting with creating and testing a collection of opts recipes for non-latin scripts, please open an issue to discuss.
All searches are currently case-insensitive; it is not possible to do a case-sensitive search.
NOTE: The testdata.json file is a diverse 162,000 string/phrase dataset 4MB in size, so first load may be slow due to network transfer. Try refreshing once it's been cached by your browser.
First, uFuzzy in isolation to demonstrate its performance.
https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy&search=super%20ma
Now the same comparison page, booted with fuzzysort, QuickScore, and Fuse.js:
Here is the full library list but with a reduced dataset (just hearthstone_750
, urls_and_titles_600
) to avoid crashing your browser:
Answers:
Else: https://github.com/leeoniya/uFuzzy/issues
npm i @leeoniya/ufuzzy
const uFuzzy = require('@leeoniya/ufuzzy');
<script src="./dist/uFuzzy.iife.min.js"></script>
let haystack = [
'puzzle',
'Super Awesome Thing (now with stuff!)',
'FileName.js',
'/feeding/the/catPic.jpg',
];
let needle = 'feed cat';
let opts = {};
let uf = new uFuzzy(opts);
// pre-filter
let idxs = uf.filter(haystack, needle);
// sort/rank only when <= 1,000 items
let infoThresh = 1e3;
if (idxs.length <= infoThresh) {
let info = uf.info(idxs, haystack, needle);
// order is a double-indirection array (a re-order of the passed-in idxs)
// this allows corresponding info to be grabbed directly by idx, if needed
let order = uf.sort(info, haystack, needle);
// render post-filtered & ordered matches
for (let i = 0; i < order.length; i++) {
// using info.idx here instead of idxs because uf.info() may have
// further reduced the initial idxs based on prefix/suffix rules
console.log(haystack[info.idx[order[i]]]);
}
}
else {
// render pre-filtered but unordered matches
for (let i = 0; i < idxs.length; i++) {
console.log(haystack[idxs[i]]);
}
}
uFuzzy provides a uf.search(haystack, needle, outOfOrder = false, infoThresh = 1e3) => [idxs, info, order]
wrapper which combines the filter
, info
, sort
steps above.
This method also implements efficient logic for matching search terms out of order.
Get your ordered matches first:
let haystack = [
'foo',
'bar',
'cowbaz',
];
let needle = 'ba';
let u = new uFuzzy();
let idxs = u.filter(haystack, needle);
let info = u.info(idxs, haystack, needle);
let order = u.sort(info, haystack, needle);
Basic innerHTML highlighter (<mark>
-wrapped ranges):
let innerHTML = '';
for (let i = 0; i < order.length; i++) {
let infoIdx = order[i];
innerHTML += uFuzzy.highlight(
haystack[info.idx[infoIdx]],
info.ranges[infoIdx],
) + '<br>';
}
console.log(innerHTML);
innerHTML highlighter with custom marking function (<b>
-wrapped ranges):
let innerHTML = '';
const mark = (part, matched) => matched ? '<b>' + part + '</b>' : part;
for (let i = 0; i < order.length; i++) {
let infoIdx = order[i];
innerHTML += uFuzzy.highlight(
haystack[info.idx[infoIdx]],
info.ranges[infoIdx],
mark,
) + '<br>';
}
console.log(innerHTML);
DOM/JSX element highlighter with custom marking and append functions:
let domElems = [];
const mark = (part, matched) => {
let el = matched ? document.createElement('mark') : document.createElement('span');
el.textContent = part;
return el;
};
const append = (accum, part) => { accum.push(part); };
for (let i = 0; i < order.length; i++) {
let infoIdx = order[i];
let matchEl = document.createElement('div');
let parts = uFuzzy.highlight(
haystack[info.idx[infoIdx]],
info.ranges[infoIdx],
mark,
[],
append,
);
matchEl.append(...parts);
domElems.push(matchEl);
}
document.getElementById('matches').append(...domElems);
uFuzzy has two operational modes which differ in matching strategy:
example
- exactexamplle
- single insertion (addition)exemple
- single substitution (replacement)exmaple
- single transposition (swap)exmple
- single deletion (omission)xamp
- partialxmap
- partial with transpositionThere are 3 phases to a search:
haystack
with a fast RegExp compiled from your needle
without doing any extra ops. It returns an array of matched indices in original order.needle
into two more-expensive RegExps that can partition each match. Therefore, it should be run on a reduced subset of the haystack, usually returned by the Filter phase. The uFuzzy demo is gated at <= 1,000 filtered items, before moving ahead with this phase.Array.sort()
to determine final result order, utilizing the info
object returned from the previous phase. A custom sort function can be provided via a uFuzzy option: {sort: (info, haystack, needle) => idxsOrder}
.A liberally-commented 200 LoC uFuzzy.d.ts file.
Options with an inter prefix apply to allowances in between search terms, while those with an intra prefix apply to allowances within each search term.
Option | Description | Default | Examples |
---|---|---|---|
intraMode | How term matching should be performed | 0 |
0 MultiInsert1 SingleErrorSee How It Works |
intraIns | Max number of extra chars allowed between each char within a term | 0 |
Searching "cat"...0 can match: cat, scat, catch, vacate1 also matches: cart, chapter, outcast |
interIns | Max number of extra chars allowed between terms | Infinity |
Searching "where is"...Infinity can match: where is, where have blah wisdom5 cannot match: where have blah wisdom |
intraSub intraTrn intraDel |
For intraMode: 1 only,Error types to tolerate within terms | 0 |
0 No1 Yes |
intraChars | Partial regexp for allowed insert chars between each char within a term | [a-z\d'] |
[a-z\d] matches only alpha-numeric (case-insensitive)[\w-] would match alpha-numeric, undercore, and hyphen |
intraFilt | Callback for excluding results based on term & match | (term, match, index) => true |
Do your own thing, maybe...
- Length diff threshold - Levenshtein distance - Term offset or content |
interChars | Partial regexp for allowed chars between terms | . |
. matches all chars[^a-z\d] would only match whitespace and punctuation |
interLft | Determines allowable term left boundary | 0 |
Searching "mania"...0 any - anywhere: romanian1 loose - whitespace, punctuation, alpha-num, case-change transitions: TrackMania, maniac2 strict - whitespace, punctuation: maniacally |
interRgt | Determines allowable term right boundary | 0 |
Searching "mania"...0 any - anywhere: romanian1 loose - whitespace, punctuation, alpha-num, case-change transitions: ManiaStar2 strict - whitespace, punctuation: mania_foo |
sort | Custom result sorting function | (info, haystack, needle) => idxsOrder |
Default: Search sort, prioritizes full term matches and char density Demo: Typeahead sort, prioritizes start offset and match length |
This assessment is extremely narrow and, of course, biased towards my use cases, text corpus, and my complete expertise in operating my own library. It is highly probable that I'm not taking full advantage of some feature in other libraries that may significantly improve outcomes along some axis; I welcome improvement PRs from anyone with deeper library knowledge than afforded by my hasty 10min skim over any "Basic usage" example and README doc.
Can-of-worms #1.
Before we discuss performance let's talk about search quality, because speed is irrelevant when your results are a strange medly of "Oh yeah!" and "WTF?".
Search quality is very subjective. What constitutes a good top match in a "typeahead / auto-suggest" case can be a poor match in a "search / find-all" scenario. Some solutions optimize for the latter, some for the former. It's common to find knobs that skew the results in either direction, but these are often by-feel and imperfect, being little more than a proxy to producing a single, composite match "score".
Let's take a look at some matches produced by the most popular fuzzy search library, Fuse.js and some others for which match highlighting is implemented in the demo.
Searching for the partial term "twili", we see these results appearing above numerous obvious "twilight" results:
Not only are these poor matches in isolation, but they actually rank higher than literal substrings.
Finishing the search term to "twilight", still scores bizzare results higher:
Some engines do better with partial prefix matches, at the expense of higher startup/indexing cost:
Here, match-sorter
returns 1,384 results, but only the first 40 are relevant. How do we know where the cut-off is?
Can-of-worms #2.
All benchmarks suck, but this one might suck more than others.
Still, something is better than a hand-wavy YMMV/do-it-yourself dismissal and certainly better than nothing.
libs
parameter to the desired library name: https://leeoniya.github.io/uFuzzy/demos/compare.html?bench&libs=uFuzzybench
mode to avoid benchmarking the DOM.test
, chest
, super ma
, mania
, puzz
, prom rem stor
, twil
.To evaluate the results for each library, or to compare several, simply visit the same page with more libs
and without bench
: https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,fuzzysort,QuickScore,Fuse&search=super%20ma.
There are several metrics evaluated:
Lib | Stars | Size (min) | Init | Search (x 86) | Heap (peak) | Retained | GC |
---|---|---|---|---|---|---|---|
uFuzzy (try) | ★ 2k | 7KB | 0.3ms | 1030ms | 26.6MB | 8MB | 30ms |
uFuzzy
(try) (external prefix caching) | 460ms | 27.5MB | 8MB | 30ms | |||
uFuzzy
(try) (outOfOrder, fuzzier) | 1275ms | 26.6MB | 8MB | 30ms | |||
uFuzzy
(try) (outOfOrder, fuzzier, SingleError) | 1200ms | 27.5MB | 8MB | 30ms | |||
------- | |||||||
Fuse.js (try) | ★ 14.8k | 23.5KB | 40ms | 35600ms | 226MB | 14.5MB | 30ms |
FlexSearch (Light) (try) | ★ 8.9k | 5.9KB | 3600ms | 145ms | 673MB | 316MB | 450ms |
Lunr.js (try) | ★ 8.2k | 29.4KB | 2500ms | 1430ms | 379MB | 121MB | 200ms |
Lyra (try) | ★ 3.4k | 30KB | 4000ms | 790ms | 199MB | 89MB | 200ms |
match-sorter (try) | ★ 3.1k | 7.3KB | 0.03ms | 10000ms | 39MB | 8MB | 30ms |
fuzzysort (try) | ★ 3k | 5.5KB | 60ms | 1850ms | 174MB | 84MB | 70ms |
Wade (try) | ★ 3k | 4KB | 1000ms | 460ms | 438MB | 42MB | 100ms |
fuzzysearch (try) | ★ 2.6k | 0.2KB | 0.1ms | 1000ms | 28MB | 8MB | 20ms |
js-search (try) | ★ 2k | 17.1KB | 9400ms | 1580ms | 1760MB | 734MB | 1400ms |
Elasticlunr.js (try) | ★ 1.9k | 18.1KB | 1600ms | 1800ms | 227MB | 70MB | 130ms |
MiniSearch (try) | ★ 1.5k | 22.4KB | 650ms | 2300ms | 428MB | 64MB | 150ms |
Fuzzyset (try) | ★ 1.3k | 2.8KB | 4000ms | 1140ms | 628MB | 238MB | 600ms |
search-index (try) | ★ 1.3k | ||||||
sifter.js (try) | ★ 1.1k | 7.5KB | 3ms | 1140ms | 40MB | 11.3MB | 30ms |
fuzzy (try) | ★ 801 | 1.4KB | 0.05ms | 6000ms | 41MB | 8MB | 30ms |
fzf-for-js (try) | ★ 538 | 15.4KB | 75ms | 6700ms | 353MB | 190MB | 160ms |
LiquidMetal (try) | ★ 285 | 4.2KB | (crash) | ||||
fast-fuzzy (try) | ★ 270 | 13.8KB | 850ms | 10300ms | 555MB | 165MB | 150ms |
ItemJS (try) | ★ 260 | ||||||
FuzzySearch (try) | ★ 184 | 3.5KB | 17ms | 10000ms | 51MB | 11.2MB | 30ms |
FuzzySearch2 (try) | ★ 173 | 19.4KB | 120ms | 6000ms | 113MB | 41MB | 30ms |
QuickScore (try) | ★ 131 | 9.1KB | 40ms | 7900ms | 172MB | 12.8MB | 30ms |
fzy (try) | ★ 115 | ||||||
fuzzy-tools (try) | ★ 13 | 2.8KB | 0.1ms | 6000ms | 92MB | 7.7MB | 30ms |
fuzzyMatch (try) | ★ 0 | 1KB | 0.05ms | 2500ms | 90MB | 8MB | 30ms |
FAQs
A tiny, efficient fuzzy matcher that doesn't suck
The npm package @leeoniya/ufuzzy receives a total of 43,884 weekly downloads. As such, @leeoniya/ufuzzy popularity was classified as popular.
We found that @leeoniya/ufuzzy demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
GitHub removed 27 malicious pull requests attempting to inject harmful code across multiple open source repositories, in another round of low-effort attacks.
Security News
RubyGems.org has added a new "maintainer" role that allows for publishing new versions of gems. This new permission type is aimed at improving security for gem owners and the service overall.
Security News
Node.js will be enforcing stricter semver-major PR policies a month before major releases to enhance stability and ensure reliable release candidates.