Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
library for fast approximate string matching using Jaro and Jaro-Winkler similarity
>>> from jarowinkler import *
>>> jaro_similarity("Johnathan", "Jonathan")
0.8796296296296297
>>> jarowinkler_similarity("Johnathan", "Jonathan")
0.9037037037037037
The implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein.
You can install this library from PyPI with pip:
pip install jarowinkler
JaroWinkler provides binary wheels for all common platforms.
For a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.
pip install git+https://github.com/maxbachmann/JaroWinkler.git@main
Any algorithms in JaroWinkler can not only be used with strings, but with any arbitary sequences of hashable objects:
from jarowinkler import jarowinkler_similarity
jarowinkler_similarity("this is an example".split(), ["this", "is", "a", "example"])
# 0.8666666666666667
So as long as two objects have the same hash they are treated as similar. You can provide a __hash__
method for your own object instances.
class MyObject:
def __init__(self, hash):
self.hash = hash
def __hash__(self):
return self.hash
jarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])
# 0.9111111111111111
All algorithms provide a score_cutoff
parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.9)
# 0.0
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.85)
# 0.8796296296296297
JaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.
from rapidfuzz import process
process.cdist(["Johnathan", "Jonathan"], ["Johnathan", "Jonathan"], scorer=jarowinkler_similarity)
array([[1. , 0.9037037],
[0.9037037, 1. ]], dtype=float32)
PRs are welcome!
Thank you :heart:
Copyright 2021 - present maxbachmann. JaroWinkler
is free and open-source software licensed under the MIT License.
FAQs
library for fast approximate string matching using Jaro and Jaro-Winkler similarity
We found that jarowinkler demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.