Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

@leeoniya/ufuzzy

Package Overview
Dependencies
Maintainers
1
Versions
21
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@leeoniya/ufuzzy

A fuzzy matcher that doesn't suck

  • 0.5.1
  • Source
  • npm
  • Socket score

Version published
Weekly downloads
57K
decreased by-19.92%
Maintainers
1
Weekly downloads
 
Created
Source

▒ μFuzzy

A tiny, efficient, fuzzy search that doesn't suck. This is my fuzzy 🐈. There are many like it, but this one is mine.


Overview

uFuzzy is a fuzzy search library designed to match a relatively short search phrase (needle) against a large list of short-to-medium phrases (haystack). It might be best described as a more forgiving String.prototype.indexOf(). Common use cases are list filtering, auto-complete/suggest, and title/name/description/filename/function searches.

Each uFuzzy match must contain all alpha-numeric characters from the needle in the same sequence, so is likely a poor fit for applications like spellcheck or fulltext/document search. However, its speed leaves ample headroom to match out-of-order terms by combining results from all permutations of the needle. When held just right, it can efficiently match against multiple object properties, too.


Features

  • Junk-free, high quality results that are dataset-independent. No need to fine-tune indexing options or boosting params to attain some arbitrary quality score cut-off.
  • Straightforward fuzziness control that can be explained to your grandma in 5min.
  • Sorting you can reason about and customize using a simple Array.sort() which gets access to each match's stats/counters. There's no composite, black box "score" to understand.
  • Concise set of options that don't interact in mysterious ways to drastically alter combined behavior.
  • Fast with low resource usage - there's no index to build, so startup is below 1ms with near-zero memory overhead. Searching a three-term phrase in a 162,000 phrase dataset takes 12ms with in-order terms or 50ms with out-of-order terms.
  • Micro, with zero dependencies - currently < 3KB min

uFuzzy demo


Demos

NOTE: The testdata.json file is a diverse 162,000 string/phrase dataset 4MB in size, so first load may be slow due to network transfer. Try refreshing once it's been cached by your browser.

First, uFuzzy in isolation to demonstrate its performance.

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy&search=super%20ma

Now the same comparison page, booted with fuzzysort, QuickScore, and Fuse.js:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,fuzzysort,QuickScore,Fuse&search=super%20ma

Here is the full library list but with a reduced dataset (just hearthstone_750, urls_and_titles_600) to avoid crashing your browser:

https://leeoniya.github.io/uFuzzy/demos/compare.html?lists=hearthstone_750,urls_and_titles_600&search=moo


Installation

Node

npm i @leeoniya/ufuzzy
const uFuzzy = require('@leeoniya/ufuzzy');

Browser

<script src="./dist/uFuzzy.iife.min.js"></script>

Usage

uFuzzy works in 3 phases:

  1. Filter - This filters the full haystack with a fast RegExp compiled from your needle without doing any extra ops. It returns an array of matched indices in original order.
  2. Info - This collects more detailed stats about the filtered matches, such as start offsets, fuzz level, prefix/suffix counters, etc. It also gathers substring match positions for range highlighting. Finally, it filters out any matches that don't conform to the desired prefix/suffix rules. To do all this it re-compiles the needle into two more-expensive RegExps that can partition each match. Therefore, it should be run on a reduced subset of the haystack, usually returned by the Filter phase. The uFuzzy demo is gated at <= 1,000 filtered items, before moving ahead with this phase.
  3. Sort - This does an Array.sort() to determine final result order, utilizing the info object returned from the previous phase. A custom sort function can be provided via a uFuzzy option: {sort: (info, haystack, needle) => idxsOrder}.
let haystack = [
    'puzzle',
    'Super Awesome Thing (now with stuff!)',
    'FileName.js',
    '/feeding/the/catPic.jpg',
];

let needle = 'feed cat';

let opts = {};

let uf = new uFuzzy(opts);

// pre-filter
let idxs = uf.filter(haystack, needle);

// sort/rank only when <= 1,000 items
if (idxs.length <= 1e3) {
  let info = uf.info(idxs, haystack, needle);

  // order is a double-indirection array (a re-order of the passed-in idxs)
  // this allows corresponding info to be grabbed directly by idx, if needed
  let order = uf.sort(info, haystack, needle);

  // render post-filtered & ordered matches
  for (let i = 0; i < order.length; i++) {
    // using info.idx here instead of idxs because uf.info() may have
    // further reduced the initial idxs based on prefix/suffix rules
    console.log(haystack[info.idx[order[i]]]);
  }
}
else {
  // render pre-filtered but unordered matches
  for (let i = 0; i < idxs.length; i++) {
    console.log(haystack[i]);
  }
}

Options

Options with an inter prefix apply to allowances in between search terms, while those with an intra prefix apply to allowances within each search term.

OptionDescriptionDefaultExamples
intraMaxMax number of extra chars allowed
between each char within a term
0 Searching "cat"...
0 can match: cat, scat, catch, vacate
1 also matches: cart, chapter, outcast
interMaxMax number of extra chars allowed between termsInfinity Searching "where is"...
Infinity can match: where is, where have blah wisdom
5 cannot match: where have blah wisdom
intraCharsPartial regexp for allowed extra
chars between each char within a term
[a-z\d] [a-z\d] matches only alpha-numeric (case-insensitive)
[\w-] would match alpha-numeric, undercore, and hyphen
intraFiltCallback for excluding results based on term & match(term, match, index) => true Do your own thing, maybe... - Length diff threshold
- Levenshtein distance
- Term offset or content
interCharsPartial regexp for allowed chars between terms. . matches all chars
[^a-z\d] would only match whitespace and punctuation
interLftDetermines allowable term left boundary0 Searching "mania"...
0 any - anywhere: romanian
1 loose - whitespace, punctuation, alpha-num, case-change transitions: TrackMania, maniac
2 strict - whitespace, punctuation: maniacally
interRgtDetermines allowable term right boundary0 Searching "mania"...
0 any - anywhere: romanian
1 loose - whitespace, punctuation, alpha-num, case-change transitions: ManiaStar
2 strict - whitespace, punctuation: mania_foo
sortCustom result sorting function(info, haystack, needle) => idxsOrder Default: Search sort, prioritizes full term matches and char density
Demo: Typeahead sort, prioritizes start offset and match length

A biased appraisal of similar work

Forget "apples and oranges"; me comparing text search engines is closer to "planes, trains, and cars". However, that doesnt mean we cannot gain some insight into a slice of operational behavior. This assessment is extremely narrow and, of course, biased towards my use cases, text corpus, and my complete expertise in operating my own library. It is highly probable that I'm not taking full advantage of some feature in other libraries that may significantly improve outcomes along some axis; I welcome improvement PRs from anyone with deeper library knowledge than afforded by my hasty 10min skim over any "Basic usage" example and README doc.

Search quality

Can-of-worms #1.

Before we discuss performance let's talk about search quality, because speed is irrelevant when your results are a strange medly of "Oh yeah!" and "WTF?".

Search quality is very subjective. What constitutes a good top match in a "typeahead / auto-suggest" case can be a poor match in a "search / find-all" scenario. Some solutions optimize for the latter, some for the former. It's common to find knobs that skew the results in either direction, but these are often by-feel and imperfect, being little more than a proxy to producing a single, composite match "score".

Let's take a look at some matches produced by the most popular fuzzy search library, Fuse.js and a some others for which match highlighting is implemented in the demo.

Searching for the partial term "twili", we see these results appearing above numerous obvious "twilight" results:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,fuzzysort,QuickScore,Fuse&search=twili

  • twirling
  • **The total number of received alerts that were invalid.
  • Tom Clancy's Ghost Recon Wildlands - ASIA Pre-order Standard Uplay Activation
  • theHunter™: Call of the Wild - Bearclaw Lite CB-60

Not only are these terrible matches in isolation, but they actually rank higher than literal substrings!

To continue the absurdity, finishing the search term to "twilight", still scores bizzare results higher:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,fuzzysort,QuickScore,Fuse&search=twilight

  • Magic: The Gathering - Duels of the Planeswalkers Wings of Light Unlock
  • The Wild Eight

Some engines to better with partial prefix matches, at the expense of higher startup/indexing cost:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,FlexSearch,match-sorter,MiniSearch&search=twili

Here, match-sorter returns 1,384 results, but only the first 40 are relevant. How do we know where the cut-off is?

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,FlexSearch,match-sorter,MiniSearch&search=super

Performance

Can-of-worms #2.

I've tried to follow any "best performance" advice when I could find it in each library's docs, but it's a certainty that some stones were left unturned when implementing ~20 different search engines.

The task:

  1. Given a diverse list of 162,000 words and phrases, assume a Latin/Western European charset (can skip any diacritics/accents normalization)
  2. Do a case-insensitive, partial/fuzzy match of the search string "super ma"
  3. Sort the results in the most sensible way, following the Principle of least astonishment
  4. Optionally highlight the matched substrings in each result
  5. Bonus points for matches with out-of-order terms
  6. Do it with the fewest resources (CPU and RAM)
LibStarsSize (min)InitSearchHeap (peak)Retained
uFuzzy (try) ★ 02.5KB0.3ms620ms25.5MB7.5MB
Fuse.js (try) ★ 14.8k23.5KB40ms35432ms306MB14MB
FlexSearch (Light) (try) ★ 8.9k5.9KB3600ms130ms673MB316MB
Lunr.js (try) ★ 8.2k29.4KB1700ms8.7ms295MB120MB
LyraSearch (try) ★ 3.3k
match-sorter (try) ★ 3.1k7.3KB0.03ms125ms145MB7.6MB
fuzzysort (try) ★ 3k5.5KB50ms12ms132MB81MB
Wade (try) ★ 3k4KB920ms3.2ms464MB42MB
fuzzysearch (try) ★ 2.6k
Elasticlunr.js (try) ★ 1.9k18.1KB1000ms1.5ms330MB67MB
MiniSearch (try) ★ 1.5k22.4KB575ms36ms209MB64MB
Fuzzyset (try) ★ 1.3k2.8KB2900ms31ms800MB238MB
search-index (try) ★ 1.3k
LiquidMetal (try) ★ 285
ItemJS (try) ★ 260
FuzzySearch (try) ★ 184
FuzzySearch2 (try) ★ 173
QuickScore (try) ★ 1319.1KB26ms489ms260MB12.5MB
fzy (try) ★ 115
fuzzyMatch (try) ★ 0

Keywords

FAQs

Package last updated on 26 Sep 2022

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc