csv-string-optimization
As part of the Technologiestiftung Berlin's open data agenda we are working on tools to support citizens deal with open data.
The string optimization module is trying to identify similar names in CSV column and allows you to reduce those to the most common string or another of your choice. It can be used as a module in other node.js scripts or as a command line tool.
The overall concept is strongly inspired by Open Refine.
A documentation of the code is available here: http://technologiestiftung.github.io/csv-string-optimization
Concept
At the heart of the module are two methods to identify similar strings.
Fingerprinting
Fingerprinting is using either phonetic or normal fingerprinting to create an abstracted string from the original string in order to quickly compare strings with one another.
KNN
KNN is a lot more processing intensive. Therefore before the distance between to strings is calculated (Levenshtein), the strings are organized into groups based on common ngrams (default size : 6).
Templates
The module is build on the concept of templates that are easy to manipulate manually. After a CSV is analysed the module creates a Template (json). This template includes groups for similar strings. The user can then decide where the grouping was done correctly and which string should be the replacement string. Using the template the user can then clean the original CSV.
In many cases when dealing with open data, you might want to update your project when new data is released. For this case you can run the analysis function again and then merge the new template with the old one and simply see if new cases were correctly matched with your existing decisions. And then use the resulting merge-template to build a new clean CSV.
Install
npm install csv-string-optimization
Usage
Node.js Package
The most common case is loading a csv and then identifying duplicates and then transforming the original
const csvOpti = require('csv-string-optimization')
csvOpti.dsv(__dirname + '/data/data-2.csv', ',')
.then(data => {
let column_name = 'name',
column = csvOpti.extractColumn(data, column_name)
let fp_template = csvOpti.createTemplate(
csvOpti.fingerprint.readableCluster(
csvOpti.fingerprint.cluster(
csvOpti.fingerprint.analyse(
column
)
)
)
)
csvOpti.save(__dirname + '/output/fp-template-'+i+'.json', fp_template)
csvOpti.saveCsv(__dirname + '/output/fp-cleaned-'+i+'.csv', csvOpti.cleanFile(d, JSON.parse(fp_template), column_name))
let reduced_column = csvOpti.knn.reduce(column),
clusters = csvOpti.knn.prepare(reduced_column)
let knn_template = csvOpti.createTemplate(
csvOpti.knn.readableCluster(
csvOpti.knn.cluster(
csvOpti.knn.analyse(
clusters, reduced_column, 0.1
)
),
reduced_column, column
)
)
csvOpti.save(__dirname + '/output/knn-template-'+i+'.json', knn_template)
csvOpti.saveCsv(__dirname + '/output/knn-cleaned-'+i+'.csv', csvOpti.cleanFile(d, JSON.parse(knn_template), column_name))
}).catch(err => {
console.log('err', err)
})
You cannot only analyse and transform a whole file, you can also use the underlying methods, e.g. the fingerprinting function:
const csvOpti = require('csv-string-optimization')
let str = 'Ich denk, dass ist eine feine Sache! äöüÄÖÜß.:-)'
console.log(str, csvOpti.fingerprint.key(str))
console.log(str, csvOpti.fingerprint.key(str, 'phonetic'))
console.log('Bezirksamt Neukölln', csvOpti.fingerprint.key('Bezirksamt Neukölln'))
Command line
If this does not work after installing the package, you might need to run:
npm link
Analyse File
csvStrOpti-analyse -c name -f /PATH-TO.csv -t /OUTPUT-PATH-TO-TEMPLATE.json -d ";"
Clean File
csvStrOpti-clean -c name -f /PATH-TO.csv -t /PATH-TO-TEMPLATE.json -d ";" -o /OUTPUT-PATH-CLEANED.csv
Merge File
csvStrOpti-merge -t /PATH-TO-OLD-TEMPLATE.json -n /PATH-TO-NEW-TEMPLATE.json -o /OUTPUT-PATH-TO-MERGED-TEMPLATE.json
Analyse & Clean File
csvStrOpti-analyse -c name -f /PATH-TO.csv -d ";" | csvStrOpti-clean -o /OUTPUT-PATH-TO-CLEANED.csv
For more information on the various parameters for each command simply call
csvStrOpti-analyse --help
License
The library is provided under the MIT license, the test data in tests/data is not provided under MIT. The data is taken from https://www.berlin.de/sen/finanzen/service/zuwendungsdatenbank and is part of Berlin's open data.