Security News
Introducing the Socket Python SDK
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
A package for comparing CSV-like files through union and difference operations.
CSVUnionDiff is an open-source library for comparing CSV-like files through union and difference operations.
To install through command-line, use
python -m pip install csvuniondiff
To view available options for the command-line tool, use
csvuniondiff -h
To use the package in python, do
from csvuniondiff import ...
where ...
can be replaced with whatever is available from the package.
Currently supported command line options are:
options:
-h, --help show this help message and exit
--version print the version of this package
--diff DIFF DIFF use the diff command, takes 2 files as arguments
--union UNION UNION use the union command, takes 2 files as arguments
-a, --align-columns aligns common columns on the left sorted
-c [USE_COLUMNS ...], --use-columns [USE_COLUMNS ...]
only use these columns for comparison
--ignore-columns [IGNORE_COLUMNS ...]
do not use these columns for comparison
-f [FILL_NULL], --fill-null [FILL_NULL]
fills null option value so that they can be compared, default is 'NULL'
-d, --drop-null drop rows with nulls
-D, --drop-duplicates
drop duplicate rows
-i INPUT_DIR, --input-dir INPUT_DIR
use this directory path as the base for the path to the files
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
save outputs from the script to this directory
-m, --match-rows use the match rows algorithm for comparison
-k [KEEP_COLUMNS ...], --keep-columns [KEEP_COLUMNS ...]
only keep these columns in the final result
-C, --use-common-columns
use the maximal set of common columns for comparison
--dont-add-timestamp don't add a timestamp directory when outputting files
--disable-printing disable printing to stdout
--print-prepared print the prepared df before comparison
--save-file-extension SAVE_FILE_EXTENSION
the extension for output files (csv, xlsx, json, xml, or html)
-r, --row-counts use the counts of each unique row in the final result instead
test2.csv:
column7 | column1 | column2 | column3 | column4 | column5 | column6 | |
---|---|---|---|---|---|---|---|
0 | value7 | value1 | value2 | value3 | value4 | value5 | value6 |
1 | value14 | value8 | value9 | value10 | value11 | value12 | value13 |
2 | value21 | value15 | value16 | value17 | value18 | value19 | value20 |
test4.csv:
column4 | column3 | column2 | column1 | |
---|---|---|---|---|
0 | value3 | value2 | value1 | |
1 | value9 | value8 | value6 | |
2 | value14 | value11 |
Input
csvuniondiff
--input-dir csvuniondiff/tests/test-data/random/
--union test2.csv test4.csv
--match-rows # use the match rows algorithm
--fill-null value4 # fills nulls with 'value4'
--align-columns # align common columns sorted on the left
--use-common-columns # compare using all common columns
Output
Timestamp: 2024-07-10 14:27:45.402980
Input directory: csvuniondiff/tests/test-data/random/
union(
args
----
left_input: ['test2.csv']
right_input: ['test4.csv']
data_save_file_extensions: ['csv']
options
-------
align_columns: True
fill_null: value4
match_rows: True
enable_printing: True
add_save_timestamp: True
use_common_columns: True
)
Intersecting rows from test2.csv (1, 7):
column1 column2 column3 column4 column7 column5 column6
0 value1 value2 value3 value4 value7 value5 value6
Intersecting rows from test4.csv (1, 4):
column1 column2 column3 column4
0 value1 value2 value3 value4
Look here for input files.
Input
csvuniondiff
--input-dir csvuniondiff/tests/test-data/diff/testset-1/
--diff csv1.csv csv2.csv
Output
Timestamp: 2024-07-10 12:00:06.554911
Input directory: csvuniondiff/tests/test-data/diff/testset-1/
diff(
args
----
left_input: ['csv1.csv']
right_input: ['csv2.csv']
data_save_file_extensions: ['csv']
options
-------
enable_printing: True
add_save_timestamp: True
)
Only in csv1.csv (5, 3):
Name Age Email
3 Michael Wilson 32 michaelwilson@example.com
4 Michael Wilson 32 michaelwilson@example.com
5 Bob Thompson 35 bobthompson@example.com
6 Emily Davis 27 emilydavis@example.com
7 Michael Wilson 32 michaelwilson@example.com
Only in csv2.csv (3, 3):
Name Age Email
6 John Smith__1 35 johnsmith@example.com
7 Michael Johnson__1 32 michaeljohnson@example.com
8 Emily Davis__1 27 emilydavis@example.com
test1.csv
Name | Age | ||
---|---|---|---|
0 | John Doe | 25 | johndoe@example.com |
1 | Jane Smith | 30 | janesmith@example.com |
2 | Mark Johnson | 40 | markjohnson@example.com |
3 | Emily Davis | 35 | emilydavis@example.com |
4 | Michael Brown | 28 | michaelbrown@example.com |
5 | Sarah Wilson | 32 | sarahwilson@example.com |
6 | David Thompson | 45 | davidthompson@example.com |
7 | Jessica Martinez | 27 | jessicamartinez@example.com |
8 | Christopher Lee | 33 | christopherlee@example.com |
9 | Laura Taylor | 29 | laurataylor@example.com |
test2.csv
Name | Age | ||
---|---|---|---|
0 | John Doe | johndoe25@example.com | 25 |
1 | Jane Smith | janesmith30@example.com | 30 |
2 | Mark Johnson | markjohnson40@example.com | 40 |
3 | Emily Davis | emilydavis35@example.com | 35 |
4 | Jessica Martinez | jessicamartinez27@example.com | 27 |
5 | Christopher Lee | christopherlee33@example.com | 33 |
6 | Laura Taylor | laurataylor29@example.com | 29 |
7 | Brian Harris | brianharris33@example.com | 33 |
Input
import pandas as pd
from csvuniondiff.csvuniondiff import (
CsvUnionDiff,
ParallelInputArgs,
CommandOptions,
)
obj = CsvUnionDiff(
"./csvuniondiff/tests/test-data/diff/testset-2/",
None,
)
def left_df_trans(df: pd.DataFrame) -> pd.DataFrame:
def email_trans(row):
arr = row["Email"].split("@")
return arr[0] + str(row["Age"]) + "@" + arr[1]
df["Email"] = df.apply(email_trans, axis=1)
df = df[["Name", "Email", "Age"]]
return df
left_dfs, right_dfs = obj.diff(
args=ParallelInputArgs(
["test1.csv"],
["test2.csv"],
left_trans_funcs=[left_df_trans],
right_trans_funcs=[lambda x: x],
return_transformed_rows=False, # selects the rows from original table
),
options=CommandOptions(
match_rows=True,
enable_printing=True
),
)
left_df = left_dfs[0]
right_df = right_dfs[0] # use the results somewhere
Output
Timestamp: 2024-07-11 11:37:38.748144
Input directory: ./csvuniondiff/tests/test-data/diff/testset-2/
diff(
args
----
left_input: ['test1.csv']
right_input: ['test2.csv']
left_trans_funcs: [<function left_df_trans at 0x000001E40ACBA340>]
right_trans_funcs: [<function <lambda> at 0x000001E4259B3E20>]
options
-------
match_rows: True
enable_printing: True
)
Only in test1.csv (3, 3):
Name Age Email
4 Michael Brown 28 michaelbrown@example.com
5 Sarah Wilson 32 sarahwilson@example.com
6 David Thompson 45 davidthompson@example.com
Only in test2.csv (1, 3):
Name Email Age
7 Brian Harris brianharris33@example.com 33
Input
from csvuniondiff.csvuniondiff import (
CsvUnionDiff,
ParallelInputArgs,
CommandOptions,
)
obj = CsvUnionDiff(
input_dir="./csvuniondiff/tests/test-data/diff/testset-1/",
output_dir=None,
)
left_dfs, right_dfs = obj.diff(
args=ParallelInputArgs(
left_input=["test1.csv"],
right_input=["test2.csv"],
),
options=CommandOptions(
enable_printing=True,
add_save_timestamp=True,
)
)
only_in_test1, only_in_test2 = left_dfs[0], right_dfs[0] # use dataframe results somewhere
Output
Timestamp: 2024-07-10 13:23:35.239955
Input directory: ./csvuniondiff/tests/test-data/diff/testset-1/
diff(
args
----
left_input: ['test1.csv']
right_input: ['test2.csv']
options
-------
match_rows: True
enable_printing: True
add_save_timestamp: True
)
Only in test1.csv (5, 3):
Name Age Email
3 Michael Wilson 32 michaelwilson@example.com
4 Michael Wilson 32 michaelwilson@example.com
5 Bob Thompson 35 bobthompson@example.com
6 Emily Davis 27 emilydavis@example.com
7 Michael Wilson 32 michaelwilson@example.com
Only in test2.csv (5, 3):
Name Age Email
1 John Doe 25 johndoe@example.com
3 Jane Smith 30 janesmith@example.com
6 John Smith__1 35 johnsmith@example.com
7 Michael Johnson__1 32 michaeljohnson@example.com
8 Emily Davis__1 27 emilydavis@example.com
csvuniondiff --diff my.csv expected.csv --match-rows
to see the differences and the magnitude of the differences. I could also call csvuniondiff --union my.csv expected.csv --match-rows
to see what rows my SQL query is getting right.I had a case where I needed to check for the existence of rows with certain values in specific columns across many Excel files. I can make a dataframe with the columns and values that I am looking for:
Name | Age | ||
---|---|---|---|
0 | John Doe | 25 | johndoe@example.com |
I can put all of the Excel files in a directory and then run the union command with the above CSV against the target CSV's in the directory.
The files are slightly different but could be transformed to be compared.
You don't want to personally code out difference and union operations with match rows and stdout output.
To explain the match rows option, let's consider the following CSV tables:
csv1 | csv2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
When matching rows in the diff operation
Therefore, with the match rows option, the results of the diff operation will be:
only in csv1 | only in csv2 | ||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Using the union operation with match rows instead with csv1 and csv2, we get:
intersecting from csv1 | intersecting from csv2 | ||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
An example of where this might be useful is when you are cross-checking using the diff operation
and you want to know the magnitude of the rows that you've missed and the extra rows that you have
in your CSV when compared against an expected CSV. In this case, you can use
csvuniondiff --diff my.csv expected.csv --match-rows
to get the rows with duplicates or
csvuniondiff --diff my.csv expected.csv --match-rows --row-counts
to
get the count of each unique row in the result.
Feel free to open an issue if something isn't working properly or you think that another feature would be worth it to add.
FAQs
A package for comparing CSV-like files through union and difference operations.
We found that csvuniondiff demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.
Security News
A new Rust RFC proposes "Trusted Publishing" for Crates.io, introducing short-lived access tokens via OIDC to improve security and reduce risks associated with long-lived API tokens.