
Security News
New CVE Forecasting Tool Predicts 47,000 Disclosures in 2025
CVEForecast.org uses machine learning to project a record-breaking surge in vulnerability disclosures in 2025.
Table Differ is a table comparison utility that aims to assist in validating changes made to large tables within several supported databases. Table Differ provides a table containing the changed and modified rows between two large tables called diff_table. This diff_table is then used to perform simple reporting on the type of changes that were made between the two tables. The primary goal of the Table Differ is to be as maleable as possible in order to be usable in as many situations as possible.
Table Differ is not necessarily intented to be used on a history of two tables, but rather to be used on any two tables with an emphasis on being able to accept any number of key columns to track rows by, along with any number of comparison columns to look for differences within. The columns to compare can also be supplied into the Table Differ by way of 'ignored columns' where any column besides the ones specified will be looked at for when dealing with wide tables.
As of right now the following databases supported by Table Differ are:
The following databases are in the process of being supported:
Installation
CLI Usage
Notebook Usage
Data-Profiler
Arguments and Configs
diff_table capabilities
While running Table Differ is very straightforward, there are a few important notes to understand about how it works.
Table Differ will attempt to create a new table within the database that it is pulling the two comparison tables from. Depending on your pipeline this could potentially cause issues.
When Table Differ runs, it will attempt to replace any table under the same schema with the name of the specified diff table prefix (defaults to __diff_table__
) + the name of the secondary table.
After Table Differ is finished running, it does not drop the table that it has created within your database, and only replaces that table at the beginning of the next run so that the diff table can be queried if needed between usages.
pip install table-differ
Note that from within a Databricks notebook, you will need to use a Databricks magic command to run the import statement directly (%
)
Depending on the database that Table-Differ will attempt to connect to, the process of connection may be slightly different. If using PostgreSQL or MySQL , Table-Differ will automatically search for a .pgpass or .my.cnf file for the required connection details. If using a database locally, Table-Differ accepts an additional optional requirement to use a directory pathing that is pulled from the config.yaml file.
After installing table-differ
through pip directly into your notebook environment, you will need to import the required class to be able to make calls.
from Table_Differ.table_differ_notebook import TableDiffer
After importing, table-differ
can be called by supplying the TableDiffer
class with two arguments:
conn
which requires some sort of connection object. This is initially built out in a DataBricks environment, so a SparkSession object would be passed into this field
args
requires a dictionary of arguments. This dictionary can be sourced from wherever you like, as long as it contains the required fields.
These arguments are outlined below in the Arguments and Configs
section of this document.
%python
%pip install table-differ
from Table_Differ.table_differ_notebook import TableDiffer
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.getOrCreate()
)
args = {
"table_initial": "catalog.schema.table_name",
"table_secondary": "catalog.schema.table_name",
"key_cols": ["key_col_1"],
"comp_cols": ["comp_col_1", "comp_col_2", "comp_col_3",]
}
TableDiffer(conn = spark, args = args)
It's important to note that any field within Arguments
or Configs
can be added to this arg dictionary, but it needs to be the fully spelled out name and not just the prefix.
# Database
db_host: str = None -> db arguments are not used in notebook and not required
db_port: str = None
db_name: str = None
db_user: str = None
db_path: str = None
db_type: str = "databricks"
# Table Info
table_initial: str -> table_initial/secondary are Required
table_secondary: str
table_diff: str = "__diff_table__" + str(date.today()) + "__"
schema_name: str = "*******_dev" -> Actual value hidden. Contact owner for more information
key_cols: list[str] -> key_cols are Required
comp_cols: list[str] = None -> comp_cols are Required
ignore_cols: list[str] = None
initial_table_alias: str = "origin"
secondary_table_alias: str = "comparison"
except_rows: list[str] = None
# System
local_db: bool = False
print_tables: bool = False
report_mode: bool = False
print_query: bool = True -> When set to 'True' table-differ will sys.exit(0) after printing query
gnome: bool = False
log_level: str = "warning"
Note: This is a new feature and currently being developed further
# System
table_name: -> Required
table_schema: -> Required
table_catalog: str -> Required
where_clause: str = None
The developement of Table Differ focuses on making a product that is as flexible and adaptable as possible, and because of this uses several potential arguments.
Example Run of Table-Differ
./main.py -c <col_1> <col_2> <col_n> -d <database> -t <first_table> <second_table> -k <key_col> -l <log_level>
Required Arguments
-k --key_cols Key Columns to track rows by within **both** tables.
This field takes n number of arguments
Optional Arguments
-c --comparison_cols columns to be specifically compared.
(columns not specified will **not** be looked at when creating the diff table).
This field takes n number of arguments.
**OR**
-i --ignore_cols columns to be specifically ignored.
(columns not specified **will** be looked at when creating the diff table).
This field takes n number of arguments.
--config-file Sources arguments from specified file
|
| Note that while --config-file and -c / -i are Optional Arguments, one must be present
|
--db_type The type of database that Table Differ will attempt to connect to.
Supported DBs: sqlite, postgres, duckdb, databricks
--except_rows Rows to be excluded from the creation of diff_table by their key.
This exists in case you would like to skip over a specific row.
This field take n number of arguments.
-l --logging_level Sets the logging level of Table Differ.|debug, info, warning, error, critical|warning|
-p --print_tables Prints the tables used in the creation of the diff_table to the console using Rich
This is useful in testing and should be used only on very small tables.
--local-db Specifies that the db path is to a local database on the running machine
--report-mode Flag to only run reporting on most recent __diff_table__.
This will not create or replace a new __diff_table__, but will still query the original
2 tables for some reports.
Configs (stored within the configs.yaml file)
db_host Hosting URL for database being used.
db_port Port number for database connection.
db_name Name of database being connected to.
db_user Username of user for database.
db_type Type of database being connected to.
db_path Path of database
schema_name Name of schema being used.
Default is set to 'public'.
table_initial Name of the first table that will be used to create the diff_table.
This can be supplied either here in the config file or as an argument if the --configs argument is set to 'n'.
table_secondary Name of the second table that will be used to create the diff_table.
This can be supplied either here in the config file or as an argument if the --configs argument is set to 'n'.
diff_table_prefix Value that Table-Differ will name the created diff table as. Default is __diff_table__.
key_columns Key columns that Table Differ will query the selected tables by.
This can be supplied either here in the config file or as an argument if the --configs argument is set to 'n'.
This field accepts n number of fields
compare_cols Columns to compare
ignore_cols Columns to ignore
initial_table_alias Placeholder name of the first table being queried in creation of the diff_table.
Default is set to 'origin'.
secondary_table_alias Placeholder name of the second table being queried in creation of the diff_table.
Default is set to 'comparison'.
While Table Differ obviously works very well at comparing a history of a single table, it is not limited to just that. Because of the emphasis on flexibility and usability, Table Differ is designed to be used in any case where you need to see the specific differences between two tables within a database.
Rows that are included into the diff_table have to meet specific conditions:
Note that all rows are compared based on given key columns, and only columns specified (or in the case of ignored columns, not specified) will be looked into for changed. If a change on a row exists but in a column not specified, it will not be added to the diff_table based on that change.
Copyright (c) [2023] [Benjamin Farmer]
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
FAQs
cross-platform table-diffing utility
We found that table-differ demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CVEForecast.org uses machine learning to project a record-breaking surge in vulnerability disclosures in 2025.
Security News
Browserslist-rs now uses static data to reduce binary size by over 1MB, improving memory use and performance for Rust-based frontend tools.
Research
Security News
Eight new malicious Firefox extensions impersonate games, steal OAuth tokens, hijack sessions, and exploit browser permissions to spy on users.