Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

csvmatch

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

csvmatch

Find fuzzy matches between CSV files.

  • 2.0.1
  • PyPI
  • Socket score

Maintainers
1

CSV Match

Find fuzzy matches between CSV files. Based on Textmatch.

Fuzzy matching is the art and science of connecting up bits of information that are written differently but represent the same thing – such as a person or company.

Installing

The best way to install is with pipx, which can be installed with Homebrew: brew install pipx. Then:

pipx install csvmatch

Getting started

The best way to approach fuzzy matching with CSV Match is to start with an exact match. From there, you can incrementally improve the results by telling CSV Match about relevant information that should be taken into account and irrelevant information that should be disregarded. Experiment with different approaches. It is helpful to know what the data looks like, and how it has been collected.

The input files can be in CSV or Parquet format. The output results will be a CSV file.

Example

data1.csv:

nameplacecodename
Sam CollinsVietnamnone
Roy BlandLondonSoldier
George SmileyLondonBeggerman
Bill HaydonLondonTailor
Perçy AllélíneLondonTinker
KretzschmarHamburgnone
Oliver LaconLondonnone
Jim PrideauxSlovakianone
Peter Guillam EsqBrixtonnone
Toby EsterhaseViennaPoorman
Connie SachsOxfordnone

data2.csv:

Person NameLocation
Maria Andreyevna OstrakovaRussia
Konny SaksOxford
Tony EsterhaseVienna
Peter GuillamBrixton
Mr Jim PrideauxCzech Republic
Lacon OliverCambridge
Claus KretzschmarHamburg
Richard BlandLondon
Roy RodgersRomania
Percy AllelineLondon
Bill-HaydonLondon
George SMILEYLondon
Roy BlandUK
Sam CollinsVietnam

To run an exact match on the name column from the first file against Person Name from the second:

$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    > matches.csv

The resulting matches include the two names which are written exactly the same:

nameplacecodenamePerson NameLocation
Roy BlandLondonSoldierRoy BlandUK
Sam CollinsVietnamnoneSam CollinsVietnam

Matches are many-to-many, ie. it is possible for one row in the first file to match several rows in the second, and vice-versa.

[!TIP] There is a tradeoff between false negatives and false positives – it is often better to have some incorrect matches in your results that can be manually checked afterwards than to have correct matches missing.

Usage

CSV Match only requires two arguments, the first file followed by the second. All others are optional, described below:

Fields

The fields1 and fields2 arguments accept one or more column names that should be used for the match. If the column name has a space it should be wrapped in quotes. These should be in the same order for both files – the first column specified for the first file will be compared against the first column specified for the second file, and so on. Defaults to comparing all columns. This flag can be specified multiple times for blocking.

Example

To match on the name and place columns from the first file against Person Name and Location from the second:

$ csvmatch \
  data1.csv \
  data2.csv \
  --fields1 name place \
  --fields2 'Person Name' Location \
  > matches.csv

The resulting matches include the single name-place pair which is the same in both files:

nameplacecodenamePerson NameLocation
Sam CollinsVietnamnoneSam CollinsVietnam

Ignorance

The ignores argument accepts one or more characteristics which should be disregarded for two records to be considered a match. This flag can be specified multiple times for blocking.

Combining different forms of ignorance can be quite powerful. The order in which you specify them is not significant.

case ignores how text is capitalised.

Example
$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --ignore case \
    > matches.csv

The resulting matches include George Smiley, whose surname is in all-capitals in the second file:

nameplacecodenamePerson NameLocation
George SmileyLondonBeggermanGeorge SMILEYLondon
Roy BlandLondonSoldierRoy BlandUK
Sam CollinsVietnamnoneSam CollinsVietnam

nonalpha ignores anything that isn't a number or a letter.

Example
$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --ignore nonalpha \
    > matches.csv

The resulting matches include Bill Haydon, whose name is written with a hypen in the second file:

nameplacecodenamePerson NameLocation
Bill HaydonLondonTailorBill-HaydonLondon
Roy BlandLondonSoldierRoy BlandUK
Sam CollinsVietnamnoneSam CollinsVietnam

nonlatin ignores nonlatin characters – so a character such as an é will be seen as if it were an e.

Example
$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --ignore nonlatin \
    > matches.csv

The resulting matches include Percy Alleline, whose name is written with several diacritics in the first file:

nameplacecodenamePerson NameLocation
Perçy AllélíneLondonTinkerPercy AllelineLondon
Roy BlandLondonSoldierRoy BlandUK
Sam CollinsVietnamnoneSam CollinsVietnam

words-leading ignores all words except the last. This is useful for matching on surnames only.

Example
$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --ignore words-leading \
    > matches.csv

The resulting matches include Toby and Tony Esterhase, Jim Prideaux and Mr Jim Prideaux, Kretzschmar and Claus Kretzschmar, as well as Roy and Richard Bland:

nameplacecodenamePerson NameLocation
Toby EsterhaseViennaPoormanTony EsterhaseVienna
Jim PrideauxSlovakianoneMr Jim PrideauxCzech Republic
KretzschmarHamburgnoneClaus KretzschmarHamburg
Roy BlandLondonSoldierRichard BlandLondon
Roy BlandLondonSoldierRoy BlandUK
Sam CollinsVietnamnoneSam CollinsVietnam

words-tailing ignore all words except the first.

Example
$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --ignore words-tailing \
    > matches.csv

The resulting matches include Peter Guillam Esq and Peter Guillam, Roy Bland and Roy Rodgers, as well as the two capitalisations of George Smiley:

nameplacecodenamePerson NameLocation
Peter Guillam EsqBrixtonnonePeter GuillamBrixton
Roy BlandLondonSoldierRoy RodgersRomania
George SmileyLondonBeggermanGeorge SMILEYLondon
Roy BlandLondonSoldierRoy BlandUK
Sam CollinsVietnamnoneSam CollinsVietnam

words-order ignores the order in which the words are given. This is useful for matching names given surname-first with those given forename-first.

Example
$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --ignore words-order \
    > matches.csv

The resulting matches include Oliver Lacon, whose name is written surname-first in the second file:

nameplacecodenamePerson NameLocation
Oliver LaconLondonnoneLacon OliverCambridge
Roy BlandLondonSoldierRoy BlandUK
Sam CollinsVietnamnoneSam CollinsVietnam

titles ignores common English name prefixes such as Mr and Ms. There is a full list of these titles here.

Example
$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --ignore titles \
    > matches.csv

The resulting matches include Jim Prideaux, who has the title 'Mr' in the second file:

nameplacecodenamePerson NameLocation
Jim PrideauxSlovakianoneMr Jim PrideauxCzech Republic
Roy BlandLondonSoldierRoy BlandUK
Sam CollinsVietnamnoneSam CollinsVietnam

regex ignores terms specific to your data using a given regular expression. This is specified inline: regex=EXPRESSION.

Example

To use the regular expression Esq$ to ignore the word 'Esq' where it appear at the end of a value:

$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --ignore 'regex= Esq$' \
    > matches.csv

The resulting matches include Peter Guillam, who has the name suffix 'Esq' in the first file:

nameplacecodenamePerson NameLocation
Peter Guillam EsqBrixtonnonePeter GuillamBrixton
Roy BlandLondonSoldierRoy BlandUK
Sam CollinsVietnamnoneSam CollinsVietnam

Methods & thresholds

The methods argument accepts one or more methods. This lets you specify the algorithm which is used to do the matching. Multiple methods are used for blocking.

There are three different categories of method:

  • Compared methods work by comparing every row from the first file with every row from the second, producing a number that represents the degree of the match. This means the amount of time required to run a match grows exponentially with the size of the input files. However, they are still useful for larger matches when using blocking.
  • Applied methods transform text into a different representation before they are matched up. These methods are quicker than compared ones, though no meaningful matching degree number is generated – either they match or they don't.
  • Custom methods have their own individual approach. CSV Match only has one, Bilenko. It generates a matching degree number.

For those matching methods that generate a matching degree number there is then a threshold filter for any two records to be considered to be a match – you can adjust this with the --threshold argument, which accepts one or more numbers between 0.0 and 1.0, defaulting to 0.6. Multiple thresholds are also used for blocking.

You can also include the matching degree number as a column by specifying it in the outputs.

[!WARNING] When working with names of people, exact matches, even when other pieces of information such as birthdays are included, are not a guarantee that the two names actually refer to the same human. Furthermore, the chance of a mismatch is unintuitively high – as illustrated by the birthday paradox.

literal is the default – it matches strings exactly, after ignored characteristics have been taken into account.

levenshtein Uses the Damerau-Levenshtein string distance metric that counts the number of changes that would have to be made to transform one string into another. Performs compared matching. Where two strings are of different lengths the longer string is used as the denominator for the threshold filter. Good at picking up typos and other small differences in spelling.

Example
$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --method levenshtein \
    > matches.csv

The resulting matches include various names with small typographical differences, though the most emblematic of this matching method would be Toby and Tony Esterhase:

nameplacecodenamePerson NameLocation
Sam CollinsVietnamnoneSam CollinsVietnam
Roy BlandLondonSoldierRoy BlandUK
George SmileyLondonBeggermanGeorge SMILEYLondon
Bill HaydonLondonTailorBill-HaydonLondon
Perçy AllélíneLondonTinkerPercy AllelineLondon
KretzschmarHamburgnoneClaus KretzschmarHamburg
Jim PrideauxSlovakianoneMr Jim PrideauxCzech Republic
Peter Guillam EsqBrixtonnonePeter GuillamBrixton
Toby EsterhaseViennaPoormanTony EsterhaseVienna

jaro uses the Jaro-Winkler string distance metric that counts characters in common, though it considers differences near the start of the string to be more significant than differences near the end. Performs compared matching. It tends to work better than Levenshtein for shorter strings of text.

Example
$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --method jaro \
    > matches.csv

The resulting matches includes many more matches than levenshtein, though also many more false positives:

nameplacecodenamePerson NameLocation
Sam CollinsVietnamnonePercy AllelineLondon
Sam CollinsVietnamnoneSam CollinsVietnam
Roy BlandLondonSoldierRichard BlandLondon
Roy BlandLondonSoldierRoy RodgersRomania
Roy BlandLondonSoldierBill-HaydonLondon
Roy BlandLondonSoldierRoy BlandUK
George SmileyLondonBeggermanGeorge SMILEYLondon
Bill HaydonLondonTailorBill-HaydonLondon
Bill HaydonLondonTailorRoy BlandUK
Perçy AllélíneLondonTinkerPeter GuillamBrixton
Perçy AllélíneLondonTinkerPercy AllelineLondon
KretzschmarHamburgnoneClaus KretzschmarHamburg
Jim PrideauxSlovakianoneMr Jim PrideauxCzech Republic
Peter Guillam EsqBrixtonnonePeter GuillamBrixton
Toby EsterhaseViennaPoormanTony EsterhaseVienna
Toby EsterhaseViennaPoormanRoy RodgersRomania
Connie SachsOxfordnoneKonny SaksOxford

metaphone uses the Double Metaphone phonetic encoding algorithm to convert words into a representation of how they are pronounced. Performs applied matching. Tends to work better for data which has been transcribed or transliterated.

Example
$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --method metaphone \
    > matches.csv

The resulting matches includes those with nonalphanumeric differences, as well as Connie Sachs and Konny Saks, names written quite differently that would be pronounced the same:

nameplacecodenamePerson NameLocation
Connie SachsOxfordnoneKonny SaksOxford
Roy BlandLondonSoldierRoy BlandUK
George SmileyLondonBeggermanGeorge SMILEYLondon
Sam CollinsVietnamnoneSam CollinsVietnam
Bill HaydonLondonTailorBill-HaydonLondon

bilenko uses Dedupe, a library built by Forest Gregg and Derek Eder based on the work of Mikhail Bilenko that will ask you to train it by asking whether different pairs of records should match. The information you give it is then extrapolated to match up the rest of the file. The more examples you give it, the better the results will be. At minimum, try to provide 10 positive matches and 10 negative matches. Performs custom matching.

Blocking

Blocking is the approach of performing multiple matches, with subsequent matches only applying to the subset of matches resulting from the previous match. This can make matches both quicker and more precise. This is an advanced topic, and can be ignored if you are happy with the quality of matches and are dealing with smaller files.

In a 'regular' match, you are really just matching using a single block. Each block is defined by: a list of fields for each file, a list of ignores, a method, and a threshold. To perform a blocked match CSV Match needs to know each of these things for each block. You specify these through list arguments, or through outer lists for those arguments where the block requires a list itself. If you specify one of these things less than the total number of blocks – such as if you had two blocks, but specified the threshold once – that value will then be used for all subsequent blocks.

Example

To specify a first block that does a case-insensitive literal match on surnames, then a second block performing a Levenshtein match on forenames:

$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --ignore case words-leading \
    --ignore words-tailing \
    --method literal levenshtein \
    > matches.csv
nameplacecodenamePerson NameLocation
KretzschmarHamburgnoneClaus KretzschmarHamburg
George SmileyLondonBeggermanGeorge SMILEYLondon
Roy BlandLondonSoldierRoy BlandUK
Sam CollinsVietnamnoneSam CollinsVietnam

Outputs

The output argument accepts a list of column names which should appear in the output, each prefixed with a number and a dot indicating which file that field is from. They are case-sensitive, and can be in any order you desire. It defaults to all columns in the first file, followed by all columns in the second.

There are some special column definitions: 1* and 2* expand into all columns from the first and second files respectively, and degree will add a column with the matching degree number.

Example

To include every column from the second file, followed by the codename column from the first, followed by the matching degree:

$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --method levenshtein \
    --output '2*' 1.codename degree \
    > matches.csv
Person NameLocationcodenamedegree
Sam CollinsVietnamnone1.0
Roy BlandUKSoldier1.0
George SMILEYLondonBeggerman0.61538464
Bill-HaydonLondonTailor0.90909094
Percy AllelineLondonTinker0.78571427
Claus KretzschmarHamburgnone0.64705884
Mr Jim PrideauxCzech Republicnone0.8
Peter GuillamBrixtonnone0.7647059
Tony EsterhaseViennaPoorman0.9285714

Join types

The join argument takes a string that indicates what other nonmatching records should be included in the output. A left-outer join will return everything from the first file, whether there was a match or not, a right-outer to do the same but for the second file, and a full-outer to return everything from both files. Where two rows didn't match the values will be blank. Defaults to an inner join, where only successful matches are returned.

Example

To include all rows from the first file, but only those that match from the second:

$ csvmatch \
    data1.csv \
    data2.csv \
    --fields1 name \
    --fields2 'Person Name' \
    --join left-outer \
    > matches.csv
nameplacecodenamePerson NameLocation
Roy BlandLondonSoldierRoy BlandUK
Sam CollinsVietnamnoneSam CollinsVietnam
George SmileyLondonBeggerman
Bill HaydonLondonTailor
Perçy AllélíneLondonTinker
KretzschmarHamburgnone
Oliver LaconLondonnone
Jim PrideauxSlovakianone
Peter Guillam EsqBrixtonnone
Toby EsterhaseViennaPoorman
Connie SachsOxfordnone

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc