
Security News
Meet Socket at Black Hat and DEF CON 2025 in Las Vegas
Meet Socket at Black Hat & DEF CON 2025 for 1:1s, insider security talks at Allegiant Stadium, and a private dinner with top minds in software supply chain security.
tfidf_matcher
is a package for fuzzymatching large datasets together.
Most fuzzy matching libraries like fuzzywuzzy
get great results, but
don't scale well due to their O(n^2) complexity.
This package provides two functions:
ngrams()
: Simple ngram generator.matcher()
: Matches a list of strings against a reference corpus.
Does this by:
cosine_similarity
function in sklearn is very
memory-inefficient for our use case).k_matches
closest matches.Define two lists; your original list (list you want matches for) and
your lookup list (list you want to match against). Typically your
lookup list will be much longer than your original list. Pass them into
the matcher
function along with the number of matches you want to
display from the lookup list using the k_matches
argument. The
result will be a pandas DataFrame containing 1 row per item in your
original list, along with k_matches
columns containing the closest
match from the lookup list, and a match score for the closest match
(which is 1 - the cosine distance between the matches normalised to
[0,1])
Simply import with import tfidf_matcher as tm
, and call the matcher
function with tm.matcher()
. It takes the following arguments:
original
: List of strings you want to match.lookup
: List of strings you want to match against.k_matches
: Number of the closest results from lookup
to return
(1 per column).ngram_length
: Length of ngrams
used in the algorithm. Anecdotal
testing shows 2 or 3 to be optimal, but feel free to tinker.For the method, thank Josh Taylor and Chris van den Berg. I wanted to adapt the methods to work nicely on a company mathcing problem I was having, and decided to build out my resultant code into a package for two reasons:
I understand the algorithms behind k-Nearest Neighbours & TF-IDF Vectorisation, but it was through implementing the ideas in the blogs linked that I was able to build this project out.
FAQs
A small package that enables super-fast TF-IDF based string matching.
We found that tfidf-matcher demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Meet Socket at Black Hat & DEF CON 2025 for 1:1s, insider security talks at Allegiant Stadium, and a private dinner with top minds in software supply chain security.
Security News
CAI is a new open source AI framework that automates penetration testing tasks like scanning and exploitation up to 3,600× faster than humans.
Security News
Deno 2.4 brings back bundling, improves dependency updates and telemetry, and makes the runtime more practical for real-world JavaScript projects.