Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Approximate String SearcheRS is a Python library written in Rust for querying an index of strings for the closest match. It implements a Levenshtein Automaton for quickly searching a trie.
pip install assrs
from assrs import BKTree, Trie, levenshtein
trie = Trie(["foo", "bar"])
trie.find_one("baz")
# ("bar", 1)
trie.find_one("abc", max_edits=1)
# None
tree = BKTree(["foo", "bar"])
tree.find_one("baz")
# ("bar", 1)
tree.find_one("abc", max_edits=1)
# None
levenshtein("kitten", "sitting")
# 3
The main problem can be formulated as finding the best match between a query string and a reasonably large index (e.g. the entire English dictionary, as below). The similarity between a pair of strings is described by the Levenshtein distance.
The naive solution is calculating the distance between the query and all the
choices and taking the minimum. This works comparatively well for a small
number of choices and a fast Levenshtein distance implementation like
rapidfuzz
. However, if the index is large and relatively static (many queries
and few or no changes to the choices) then the necessary computation can be
significantly reduced.
One solution is constructing a BK-tree
for the set of choices. The triangle inequality helps reduce the necessary
distance calculations by limiting the search. In practice, the performance
seems to heavily depend on the insertion order, and is significantly helped by
having a low max_edits
value.
A better option is using a trie as an
index; the best match can be found by traversing it with a Levenshtein
Automaton that allows us to exclude subtries which cannot contain a
sufficiently good match. This should allow us to remove unnecessary distance
calculations much more effectively, and can also take advantage of setting
max_edits
.
The above is combined with a reasonably performant implementation of Levenshtein distance, using the bitvector algorithm by Myers for strings up to 64 characters long.
Using rapidfuzz as a reference. Results of running test.py
with Python
3.11 on a Mac mini 2020 (M1, 16GB RAM).
Taking a dictionary (235,976 words) as index and every 1000th as a query:
assrs.Trie.find_one
: 0.98msassrs.Trie.find_one(..., max_edits=3)
: 0.43msassrs.BKTree.find_one
: 5.54msassrs.BKTree.find_one(..., max_edits=3)
: 2.92msassrs.levenshtein_extract
: 9.39msassrs.levenshtein
in a Python loop: 32.02msrapidfuzz.process.extractOne(..., scorer=rapidfuzz.distance.Levenshtein.distance)
: 4.08msrapidfuzz.distance.Levenshtein.distance
in a Python loop: 44.20msHowever, with 100,000 random strings of length 10 as index and querying random strings:
assrs.Trie.find_one
: 17.60msassrs.Trie.find_one(..., max_edits=3)
: 5.39msassrs.BKTree.find_one
: 10.14msassrs.BKTree.find_one(..., max_edits=3)
: 10.14msassrs.levenshtein_extract
: 6.81msassrs.levenshtein
in a Python loop: 13.94msrapidfuzz.process.extractOne(..., scorer=rapidfuzz.distance.Levenshtein.distance)
: 4.21msrapidfuzz.distance.Levenshtein.distance
in a Python loop: 18.07msThe tree based structures have a significant advantage if the index is
relatively low entropy, like a dictionary of words from a natural language.
However, a random set of strings causes especially poor performance for tries
due to the excessive branching (e.g. considering that 62^3 is 238,328, it is
highly likely that the number of explored nodes is roughly the same order of
magnitude as the size of the index), and limits the benefits from the structure
of the index. Instead, the overhead from traversing the tree, and extra
distance calculations, can mean they are slower than straightforwardly
iterating through the list of choices. Regardless, using a Trie
with a
max_edits
limit remains competitive even in the worst-case scenario and
offers a significant speedup in case of a nicer index.
The difference between assrs.levenshtein_extract
and
rapidfuzz.process.extractOne
(that notably disappears when the corresponding
distance functions are called in a Python loop) seems likely attributable to
this library not using SIMD operations.
Currently missing features and known issues:
FAQs
Approximate string search
We found that assrs demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.