
Security News
ECMAScript 2025 Finalized with Iterator Helpers, Set Methods, RegExp.escape, and More
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
String grouper contains functions to do string matching using TF-IDF and the cossine similarity.
The image displayed above is a visualization of the graph-structure of one of the groups of strings found by string_grouper
. Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here 0.8
).
The centroid of the group, as determined by string_grouper
(see tutorials/group_representatives.md for an explanation), is the largest node, also with the most edges originating from it. A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.
The power of string_grouper
is discernible from this image: in large datasets, string_grouper
is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.
This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by string_grouper
operating on the sec__edgar_company_info.csv sample data file.
string_grouper
is a library that makes finding groups of similar strings within a single, or multiple, lists of
strings easy — and fast. string_grouper
uses tf-idf to calculate cosine similarities
within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.
pip install string-grouper
string_grouper
leverages the blazingly fast sparse_dot_topn libary
to calculate cosine similarities.
s = datetime.datetime.now()
matches = match_strings(names['Company Name'], number_of_processes = 4)
e = datetime.datetime.now()
diff = (e - s)
str(diff)
Results in:
00:05:34.65
On an Intel i7-6500U CPU @ 2.50GHz, where len(names)
= 663 000
in other words, the library is able to perform fuzzy matching of 663 000 names in five and a half minutes on a 2015 consumer CPU using 4 cores.
import pandas as pd
from string_grouper import match_strings
company_names = 'sec__edgar_company_info.csv'
companies = pd.read_csv(company_names)
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()
left_index | left_Company Name | similarity | right_Company Name | right_index | |
---|---|---|---|---|---|
15 | 14 | 0210, LLC | 0.870291 | 90210 LLC | 4211 |
167 | 165 | 1 800 MUTUALS ADVISOR SERIES | 0.931615 | 1 800 MUTUALS ADVISORS SERIES | 166 |
168 | 166 | 1 800 MUTUALS ADVISORS SERIES | 0.931615 | 1 800 MUTUALS ADVISOR SERIES | 165 |
172 | 168 | 1 800 RADIATOR FRANCHISE INC | 1 | 1-800-RADIATOR FRANCHISE INC. | 201 |
178 | 173 | 1 FINANCIAL MARKETPLACE SECURITIES LLC /BD | 0.949364 | 1 FINANCIAL MARKETPLACE SECURITIES, LLC | 174 |
companies[["group-id", "name_deduped"]] = group_similar_strings(companies['Company Name'])
companies.groupby('name_deduped')['Line Number'].count().sort_values(ascending=False).head(10)
name_deduped | Line Number |
---|---|
ADVISORS DISCIPLINED TRUST | 1747 |
NUVEEN TAX EXEMPT UNIT TRUST SERIES 1 | 916 |
GUGGENHEIM DEFINED PORTFOLIOS, SERIES 1200 | 652 |
U S TECHNOLOGIES INC | 632 |
CAPITAL MANAGEMENT LLC | 628 |
CLAYMORE SECURITIES DEFINED PORTFOLIOS, SERIES 200 | 611 |
E ACQUISITION CORP | 561 |
CAPITAL PARTNERS LP | 561 |
FIRST TRUST COMBINED SERIES 1 | 560 |
PRINCIPAL LIFE INCOME FUNDINGS TRUST 20 | 544 |
The documentation can be found here
FAQs
String grouper contains functions to do string matching using TF-IDF and the cossine similarity.
We found that string-grouper demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
Security News
A new Node.js homepage button linking to paid support for EOL versions has sparked a heated discussion among contributors and the wider community.
Research
North Korean threat actors linked to the Contagious Interview campaign return with 35 new malicious npm packages using a stealthy multi-stage malware loader.