Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
library for fast approximate string matching using Jaro and Jaro-Winkler similarity
>>> from jarowinkler import *
>>> jaro_similarity("Johnathan", "Jonathan")
0.8796296296296297
>>> jarowinkler_similarity("Johnathan", "Jonathan")
0.9037037037037037
The implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein.
You can install this library from PyPI with pip:
pip install jarowinkler
JaroWinkler provides binary wheels for all common platforms.
For a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.
pip install git+https://github.com/maxbachmann/JaroWinkler.git@main
Any algorithms in JaroWinkler can not only be used with strings, but with any arbitary sequences of hashable objects:
from jarowinkler import jarowinkler_similarity
jarowinkler_similarity("this is an example".split(), ["this", "is", "a", "example"])
# 0.8666666666666667
So as long as two objects have the same hash they are treated as similar. You can provide a __hash__
method for your own object instances.
class MyObject:
def __init__(self, hash):
self.hash = hash
def __hash__(self):
return self.hash
jarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])
# 0.9111111111111111
All algorithms provide a score_cutoff
parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.9)
# 0.0
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.85)
# 0.8796296296296297
JaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.
from rapidfuzz import process
process.cdist(["Johnathan", "Jonathan"], ["Johnathan", "Jonathan"], scorer=jarowinkler_similarity)
array([[1. , 0.9037037],
[0.9037037, 1. ]], dtype=float32)
PRs are welcome!
Thank you :heart:
Copyright 2021 - present maxbachmann. JaroWinkler
is free and open-source software licensed under the MIT License.
FAQs
library for fast approximate string matching using Jaro and Jaro-Winkler similarity
We found that jarowinkler demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.