CSV-Diff
CSV-Diff is a small library for performing diffs of tabular data, typically
data loaded from CSV files.
Unlike a standard diff that compares line by line, and is sensitive to the
ordering of records, CSV-Diff identifies common lines by key field(s), and
then compares the contents of the fields in each line.
Data may be supplied in the form of CSV files, or as an array of arrays.
More complex usage also allows you to specify XPath expressions to extract
tabular data from XML documents for diffing.
The diff process provides a fine level of control over what to diff, and can
optionally ignore certain types of changes (e.g. adds, deletes, changes in
position etc).
CSV-Diff is particularly well suited to data in parent-child format. Parent-
child data does not lend itself well to standard text diffs, as small changes
in the organisation of the tree at an upper level can lead to big movements
in the position of descendant records. By instead matching records by key,
CSV-Diff avoids this issue, while still being able to detect changes in
sibling order.
Usage
CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:
```
gem install csv-diff
```
To compare two CSV files where the field names are in the first row of the file,
and the first field contains the unique key for each record, simply use:
```ruby
require 'csv-diff'
diff = CSVDiff.new(file1, file2)
```
The returned diff object can be queried for the differences that exist between
the two files, e.g.:
puts diff.summary.inspect
puts diff.adds.inspect
puts diff.deletes.inspect
puts diff.updates.inspect
puts diff.moves.inspect
puts diff.diffs.inspect
puts diff.warnings.inspect
Unique Row Identifiers
CSVDiff is preferable over a standard line-by-line diff when row order is
significantly impacted by small changes. The classic example is a parent-child
file generated by a hierarchy traversal. A simple change in position of a parent
member near the root of the hierarchy will have a large impact on the positions
of all descendant rows. Consider the following example:
Root
|- A
| |- A1
| |- A2
|
|- B
|- B1
|- B2
A hierarchy traversal of this tree into a parent-child format would generate a CSV
as follows:
Root,A
A,A1
A,A2
Root,B
B,B1
B,B2
If the positions of A and B were swapped, a hierarchy traversal would now produce a CSV
as follows:
Root,B
B,B1
B,B2
Root,A
A,A1
A,A2
A simple diff using a diff utility would highlight this as 3 additions and 3 deletions.
CSVDiff, however, would classify this as 2 moves (a change in sibling position for A and B).
In order to do this, CSVDiff needs to know what field(s) confer uniqueness on each row.
In this example, we could use the child field alone (since each member name only appears
once); however, this would imply a flat structure, where all rows are children of a single
parent. This in turn would cause CSVDiff to classify the above change as a Move (i.e. a
change in order) of all 6 rows.
The more correct specification of this file is that column 0 contains a unique parent
identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly
deduce that there are in fact only two changes in order - the swap in positions of A and
B below Root.
Note: If you aren't interested in changes in the order of siblings, then you could use
CSVDiff with a :key_field option of column 1, and specify the :ignore_moves option.
Warnings
When processing and diffing files, CSVDiff may encounter problems with the data or
the specifications it has been given. It will continue even in the face of problems,
but will log details of the problems in a #warnings Array. The number of warnings
will also be included in the Hash returned by the #summary method.
Warnings may be raised for any of the following:
- Missing fields: If the right/to file contains fields that are not present in the
left/from file, a warning is raised and the field is ignored for diff purposes.
- Duplicate keys: If two rows are found that have the same values for the key field(s),
a warning is raised, and the duplicate values are ignored.
Examples
The simplest use case is as shown above, where the data to be diffed is in CSV files
with the column names as the first record, and where the unique key is the first
column in the data. In this case, a diff can be created simply via:
```ruby
diff = CSVDiff.new(file1, file2)
```
Specifying Unique Row Identifiers
Often however, rows are not uniquely identifiable via the first column in the file.
In a parent-child hierarchy, for example, combinations of parent and child may be
necessary to uniquely identify a row, while in other cases a combination of fields
may be needed to derive a natural unique key or identifier for each row.
In these cases, it is necessary to indicate to CSVDiff which fields are needed to
uniquely identify common rows across the two files. This can be done in several
different ways.
:key_field(s)
The first method is using the key_fields option (or key_field if you have only a
single key field). Use this option when your data represents a flat structure rather
than a parent-child hierarchy or flattened tree. You can specify key_fields using
field numbers/column indices (0-based):
```ruby
diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
```
Alternatively, you can use the :key_fields options with column names (provided CSVDiff
knows the names of your fields, either via the field_names option or from headers
in the file):
```ruby
diff = CSVDiff.new(file1, file2, key_fields: ['First Name', 'Last Name'])
```
:parent_field(s)/:child_field(s)
The second method for identifying the unique identifiers in your file is to use the
:parent_fields and :child_fields options. Use this option when your data represents
a tree structure flattened to a table in parent-child form.
Using the :parent_fields and :child_fields with field numbers:
```ruby
diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
```
Using the :parent_fields and :child_fields with column names:
```ruby
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
```
Using Non-CSV Sources
Data from non-CSV sources can be diffed, as long as it can be supplied as an Array
of Arrays:
DATA1 = [
['Parent', 'Child', 'Description'],
['A', 'A1', 'Account 1'],
['A', 'A2', 'Account 2']
]
DATA2 = [
['Parent', 'Child', 'Description'],
['A', 'A1', 'Account1'],
['A', 'A2', 'Account2']
]
diff = CSVDiff.new(DATA1, DATA2, parent_field: 1, child_field: 0)
Data can also be diffed if it is an XML source, although this requires a little
more effort to tell CSVDiff how to transform/extract content from the XML document
into an array-of-arrays form. It also introduces a dependency on Nokogiri - you
will need to install this gem to use CSVDiff with XML sources.
The first step is to use the CSVDiff::XMLSource class to define how to convert
your XML content to an array-of-arrays. The XMLSource class is quite flexible,
and can be used to convert single or multiple XML sources into a single data set
for diffing, and different documents may even have different layouts.
The first step is to create an XMLSource object, which requires a label to
identify the type of data it will generate:
xml_source_1 = CSVDiff::XMLSource.new('My Label')
Next, we pass XML documents to this source, and specify XPath expressions for each
row and column of data to produce via the process(rec_xpath, field_maps, options)
method:
- An XPath expression is provided to select each node value in the document that
will represent a row. Taking an HTML table as an example of something we wanted
to parse, your rec_xpath value might be something like the following:
'//table/tbody/tr'
. This would locate all tables in the document, and create
a new row of data in the XMLSource every time a <tr>
tag was encountered. - A hash of field_maps is then provided to describe how to generate column values
for each row of data. The keys to field_maps are the names of the fields to be
output, while the values are the epression for how to generate values. Most
commonly, this will be another XPath expression that is evaluated in the context
of the node returned by the row XPath expression. So continuing our HTML example,
we might use
'./td[0]/text()'
as an expression to select the content of the
first <td>
element within the <tr>
representing the current row.
xml_source1.process('//table/tbody/tr',
col_A: './td[0]/text()',
col_B: './td[1]/text()',
col_C: './td[2]/text()')
Finally, to diff two XML sources, we create a CSVDiff object with two XMLSource
objects as the source:
diff = CSVDiff.new(xml_source1, xml_source2, key_field: 'col_A')
Specifying Column Names
If your data file does not include column headers, you can specify the names of
each column when creating the diff. The names supplied are the keys used in the
diff results:
DATA1 = [
['A', 'A1', 'Account 1'],
['A', 'A2', 'Account 2']
]
DATA2 = [
['A', 'A1', 'Account1'],
['A', 'A2', 'Account2']
]
diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0], field_names: ['Parent', 'Child', 'Description'])
If your data file does contain a header row, but you wish to use your own column
names, you can specify the :field_names option and the :ignore_header option to
ignore the first row.
Ignoring Fields
If your data contains fields that you aren't interested in, these can be excluded
from the diff process using the :ignore_fields option:
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
ignore_fields: ['CreatedAt', 'UpdatedAt'])
Filtering Rows
If you need to filter source data before running the diff process, you can use the :include
and :exclude options to do so. Both options take a Hash as their value; the hash should have
keys that are the field names or indexes (0-based) on which to filter, and whose values are
regular expressions or lambdas to be applied to values of the corresponding field. Rows will
only be diffed if they satisfy :include conditions, and do not satisfy :exclude conditions.
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
include: {HomeTeam: 'Arsenal'}, exclude: {Referee: /Clattenburg/})
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
include: {Date: lambda{ |d| holiday_period.include?(Date.strptime(d, '%y/%m/%d')) } })
Ignoring Certain Changes
CSVDiff identifies Adds, Updates, Moves and Deletes; any of these changes can be selectively
ignored, e.g. if you are not interested in Deletes, you can pass the :ignore_deletes option:
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
ignore_fields: ['CreatedAt', 'UpdatedAt'],
ignore_deletes: true, ignore_moves: true)