Table-Differ: Multi-Database Supported Table Comparison Utility
Overview
Table Differ is a table comparison utility that aims to assist in validating changes made to large tables within several supported databases.
Table Differ provides a table containing the changed and modified rows between two large tables called diff_table. This diff_table is then
used to perform simple reporting on the type of changes that were made between the two tables. The primary goal of the Table Differ is to be as maleable
as possible in order to be usable in as many situations as possible.
Table Differ is not necessarily intented to be used on a history of two tables, but rather to be used on any two tables with an emphasis on being able to accept
any number of key columns to track rows by, along with any number of comparison columns to look for differences within. The columns to compare can also be supplied
into the Table Differ by way of 'ignored columns' where any column besides the ones specified will be looked at for when dealing with wide tables.
As of right now the following databases supported by Table Differ are:
- SQLite
- PostgreSQL
- DuckDB
- DataBricks -> only usable through notebooks and not the command line
The following databases are in the process of being supported:
TABLE OF CONTENTS
-
-
-
Installation
-
CLI Usage
- Running Table Differ
- Connecting to a Database
- Use Cases of Table Differ
-
Notebook Usage
- Running Table Differ
- Example Usage
- Default Notebook Arguments
-
Data-Profiler
-
Arguments and Configs
-
diff_table capabilities
-
-
Usage
Important Notes
While running Table Differ is very straightforward, there are a few important notes to understand about how it works.
-
Table Differ will attempt to create a new table within the database that it is pulling the two comparison tables from. Depending on your pipeline this could potentially cause issues.
-
When Table Differ runs, it will attempt to replace any table under the same schema with the name of the specified diff table prefix (defaults to __diff_table__
) + the name of the secondary table.
-
After Table Differ is finished running, it does not drop the table that it has created within your database, and only replaces that table at the beginning of the next run so that the diff table can be queried if needed between usages.
Installation
pip install table-differ
Note that from within a Databricks notebook, you will need to use a Databricks magic command to run the import statement directly (%
)
Command Line Usage
Connecting to a Database
Depending on the database that Table-Differ will attempt to connect to, the process of connection may be slightly different.
If using PostgreSQL or MySQL , Table-Differ will automatically search for a .pgpass or .my.cnf file for the required connection details.
If using a database locally, Table-Differ accepts an additional optional requirement to use a directory pathing that is pulled from the config.yaml file.
Notebook Usage
Connecting to table-differ
After installing table-differ
through pip directly into your notebook environment, you will need to import the required class to be able to make calls.
from Table_Differ.table_differ_notebook import TableDiffer
After importing, table-differ
can be called by supplying the TableDiffer
class with two arguments:
-
conn
which requires some sort of connection object. This is initially built out in a DataBricks environment, so a SparkSession object would be passed into this field
-
args
requires a dictionary of arguments. This dictionary can be sourced from wherever you like, as long as it contains the required fields.
These arguments are outlined below in the Arguments and Configs
section of this document.
Example Usage
%python
%pip install table-differ
from Table_Differ.table_differ_notebook import TableDiffer
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.getOrCreate()
)
args = {
"table_initial": "catalog.schema.table_name",
"table_secondary": "catalog.schema.table_name",
"key_cols": ["key_col_1"],
"comp_cols": ["comp_col_1", "comp_col_2", "comp_col_3",]
}
TableDiffer(conn = spark, args = args)
It's important to note that any field within Arguments
or Configs
can be added to this arg dictionary, but it needs to be the fully spelled out name and not just the prefix.
Default Arguments for Notebook Usage
# Database
db_host: str = None -> db arguments are not used in notebook and not required
db_port: str = None
db_name: str = None
db_user: str = None
db_path: str = None
db_type: str = "databricks"
# Table Info
table_initial: str -> table_initial/secondary are Required
table_secondary: str
table_diff: str = "__diff_table__" + str(date.today()) + "__"
schema_name: str = "*******_dev" -> Actual value hidden. Contact owner for more information
key_cols: list[str] -> key_cols are Required
comp_cols: list[str] = None -> comp_cols are Required
ignore_cols: list[str] = None
initial_table_alias: str = "origin"
secondary_table_alias: str = "comparison"
except_rows: list[str] = None
# System
local_db: bool = False
print_tables: bool = False
report_mode: bool = False
print_query: bool = True -> When set to 'True' table-differ will sys.exit(0) after printing query
gnome: bool = False
log_level: str = "warning"
Data-Profiler
Note: This is a new feature and currently being developed further
Default Arguments for Data Profiler
# System
table_name: -> Required
table_schema: -> Required
table_catalog: str -> Required
where_clause: str = None
Arguments and Configs
The developement of Table Differ focuses on making a product that is as flexible and adaptable as possible, and because of this uses several potential arguments.
Example Run of Table-Differ
./main.py -c <col_1> <col_2> <col_n> -d <database> -t <first_table> <second_table> -k <key_col> -l <log_level>
Required Arguments
-k --key_cols Key Columns to track rows by within **both** tables.
This field takes n number of arguments
Optional Arguments
-c --comparison_cols columns to be specifically compared.
(columns not specified will **not** be looked at when creating the diff table).
This field takes n number of arguments.
**OR**
-i --ignore_cols columns to be specifically ignored.
(columns not specified **will** be looked at when creating the diff table).
This field takes n number of arguments.
--config-file Sources arguments from specified file
|
| Note that while --config-file and -c / -i are Optional Arguments, one must be present
|
--db_type The type of database that Table Differ will attempt to connect to.
Supported DBs: sqlite, postgres, duckdb, databricks
--except_rows Rows to be excluded from the creation of diff_table by their key.
This exists in case you would like to skip over a specific row.
This field take n number of arguments.
-l --logging_level Sets the logging level of Table Differ.|debug, info, warning, error, critical|warning|
-p --print_tables Prints the tables used in the creation of the diff_table to the console using Rich
This is useful in testing and should be used only on very small tables.
--local-db Specifies that the db path is to a local database on the running machine
--report-mode Flag to only run reporting on most recent __diff_table__.
This will not create or replace a new __diff_table__, but will still query the original
2 tables for some reports.
Configs (stored within the configs.yaml file)
db_host Hosting URL for database being used.
db_port Port number for database connection.
db_name Name of database being connected to.
db_user Username of user for database.
db_type Type of database being connected to.
db_path Path of database
schema_name Name of schema being used.
Default is set to 'public'.
table_initial Name of the first table that will be used to create the diff_table.
This can be supplied either here in the config file or as an argument if the --configs argument is set to 'n'.
table_secondary Name of the second table that will be used to create the diff_table.
This can be supplied either here in the config file or as an argument if the --configs argument is set to 'n'.
diff_table_prefix Value that Table-Differ will name the created diff table as. Default is __diff_table__.
key_columns Key columns that Table Differ will query the selected tables by.
This can be supplied either here in the config file or as an argument if the --configs argument is set to 'n'.
This field accepts n number of fields
compare_cols Columns to compare
ignore_cols Columns to ignore
initial_table_alias Placeholder name of the first table being queried in creation of the diff_table.
Default is set to 'origin'.
secondary_table_alias Placeholder name of the second table being queried in creation of the diff_table.
Default is set to 'comparison'.
Use Cases of Table Differ
While Table Differ obviously works very well at comparing a history of a single table, it is not limited to just that.
Because of the emphasis on flexibility and usability, Table Differ is designed to be used in any case where
you need to see the specific differences between two tables within a database.
diff_table capabilities
Rows that are included into the diff_table have to meet specific conditions:
- Rows that have a changed value from the initial to secondary table
- Rows that exist in the initial table but not in the secondary
- Rows that exist in the secondary table but not in the initial
Note that all rows are compared based on given key columns, and only columns specified (or in the case of ignored columns, not specified) will be looked into for changed.
If a change on a row exists but in a column not specified, it will not be added to the diff_table based on that change.
CREDITS
- Benjamin Farmer - San Juan Data LLC. (2023)
License
Copyright (c) [2023] [Benjamin Farmer]
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.