
Security News
npm Adopts OIDC for Trusted Publishing in CI/CD Workflows
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Welcome to Leven-Search, a library designed for efficient and fast searching of words within a specified Levenshtein distance.
This library is designed with Kaggle developers and researchers in mind as well as all others who deal with natural language processing, text analysis, and similar domains where the closeness of strings is a pivotal aspect.
Levenshtein distance measures the difference between two sequences. In the context of strings, it is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another.
For example, the Levenshtein distance between "table" and "marble" is 2:
table
→ mable
(substitution of t
for `m')mable
→ marble
(insertion of r
)The library is designed with the following goals in mind:
Example performance of the library on a Brown corpus (only words larger than 2 characters) and a modern laptop:
Distance | Time per 1000 searches (in seconds) |
---|---|
0 | 0.0146 |
1 | 0.3933 |
1 (*) | 0.4154 |
2 | 7.9556 |
(*) with the per-letter cost granularity
To install the library, simply run:
pip install leven-search
First, import the library:
import leven_search as lev
Then, create a LevenSearch object:
searcher = lev.LevenSearch()
Next, add words to the searcher:
searcher.insert("hello")
searcher.insert("world")
Finally, search for words within a specified Levenshtein distance:
searcher.find_dist("mello", 1)
Result:
hello: ResultItem(word='hello', dist=1, updates=[m -> h])
The following example shows how to use the library to search for words within a Brown corpus:
import nltk
import leven_search as lev
# Download the Brown corpus
nltk.download('brown')
# Create a LevenSearch object
searcher = lev.LevenSearch()
for w in nltk.corpus.brown.words():
if len(w) > 2:
searcher.insert(w)
# Search for words within a Levenshtein distance
searcher.find_dist('komputer', 1)
Result:
computer: ResultItem(word='computer', dist=1, updates=[k -> c])
cost = lev.GranularEditCostConfig(default_cost=2, edit_costs=[lev.EditCost('k', 'c', 0.1)])
searcher.find_dist('komputer', 2, cost)
Result:
computer: ResultItem(word='computer', dist=0.1, updates=[k -> c])
searcher.find_dist('yomputer', 2, cost)
Result:
computer: ResultItem(word='computer', dist=2, updates=[y -> c])
searcher.find_dist('yomputer', 1, cost)
Result:
None
FAQs
Fast and flexible search in a dictionary using Levenshtein distance
We found that leven-search demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Research
/Security News
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
Security News
The CNA Scorecard ranks CVE issuers by data completeness, revealing major gaps in patch info and software identifiers across thousands of vulnerabilities.