🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more
Socket
DemoInstallSign in
Socket

string-grouper

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

string-grouper

String grouper contains functions to do string matching using TF-IDF and the cossine similarity.

0.7.1
PyPI
Maintainers
1

String Grouper

pypi license lastcommit codecov PyPI Downloads

Click to see image

The image displayed above is a visualization of the graph-structure of one of the groups of strings found by string_grouper. Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here 0.8).

The centroid of the group, as determined by string_grouper (see tutorials/group_representatives.md for an explanation), is the largest node, also with the most edges originating from it. A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.

The power of string_grouper is discernible from this image: in large datasets, string_grouper is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.

———

This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by string_grouper operating on the sec__edgar_company_info.csv sample data file.

string_grouper is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast. string_grouper uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.

Installing

pip install string-grouper

Speed

string_grouper leverages the blazingly fast sparse_dot_topn libary to calculate cosine similarities.

s = datetime.datetime.now()
matches = match_strings(names['Company Name'], number_of_processes = 4)

e = datetime.datetime.now()
diff = (e - s)
str(diff)

Results in:

00:05:34.65 On an Intel i7-6500U CPU @ 2.50GHz, where len(names) = 663 000

in other words, the library is able to perform fuzzy matching of 663 000 names in five and a half minutes on a 2015 consumer CPU using 4 cores.

Simple Match

import pandas as pd
from string_grouper import match_strings

company_names = 'sec__edgar_company_info.csv'
companies = pd.read_csv(company_names)
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()
left_indexleft_Company Namesimilarityright_Company Nameright_index
15140210, LLC0.87029190210 LLC4211
1671651 800 MUTUALS ADVISOR SERIES0.9316151 800 MUTUALS ADVISORS SERIES166
1681661 800 MUTUALS ADVISORS SERIES0.9316151 800 MUTUALS ADVISOR SERIES165
1721681 800 RADIATOR FRANCHISE INC11-800-RADIATOR FRANCHISE INC.201
1781731 FINANCIAL MARKETPLACE SECURITIES LLC /BD0.9493641 FINANCIAL MARKETPLACE SECURITIES, LLC174

Group Similar Strings and Find most Common

companies[["group-id", "name_deduped"]] = group_similar_strings(companies['Company Name'])
companies.groupby('name_deduped')['Line Number'].count().sort_values(ascending=False).head(10)
name_dedupedLine Number
ADVISORS DISCIPLINED TRUST1747
NUVEEN TAX EXEMPT UNIT TRUST SERIES 1916
GUGGENHEIM DEFINED PORTFOLIOS, SERIES 1200652
U S TECHNOLOGIES INC632
CAPITAL MANAGEMENT LLC628
CLAYMORE SECURITIES DEFINED PORTFOLIOS, SERIES 200611
E ACQUISITION CORP561
CAPITAL PARTNERS LP561
FIRST TRUST COMBINED SERIES 1560
PRINCIPAL LIFE INCOME FUNDINGS TRUST 20544

Documentation

The documentation can be found here

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts