Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

table-differ

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

table-differ

cross-platform table-diffing utility

  • 0.2.2
  • PyPI
  • Socket score

Maintainers
1

Table-Differ: Multi-Database Supported Table Comparison Utility

Overview

Table Differ is a table comparison utility that aims to assist in validating changes made to large tables within several supported databases. Table Differ provides a table containing the changed and modified rows between two large tables called diff_table. This diff_table is then used to perform simple reporting on the type of changes that were made between the two tables. The primary goal of the Table Differ is to be as maleable as possible in order to be usable in as many situations as possible.

Table Differ is not necessarily intented to be used on a history of two tables, but rather to be used on any two tables with an emphasis on being able to accept any number of key columns to track rows by, along with any number of comparison columns to look for differences within. The columns to compare can also be supplied into the Table Differ by way of 'ignored columns' where any column besides the ones specified will be looked at for when dealing with wide tables.

As of right now the following databases supported by Table Differ are:

  • SQLite
  • PostgreSQL
  • DuckDB
  • DataBricks -> only usable through notebooks and not the command line

The following databases are in the process of being supported:

  • MySql

TABLE OF CONTENTS

  1. Overview

  2. Usage

    • Installation

    • CLI Usage

      • Running Table Differ
      • Connecting to a Database
      • Use Cases of Table Differ
    • Notebook Usage

      • Running Table Differ
      • Example Usage
      • Default Notebook Arguments
    • Data-Profiler

    • Arguments and Configs

    • diff_table capabilities

  3. Credits

  4. License


Usage

Important Notes

While running Table Differ is very straightforward, there are a few important notes to understand about how it works.

  1. Table Differ will attempt to create a new table within the database that it is pulling the two comparison tables from. Depending on your pipeline this could potentially cause issues.

  2. When Table Differ runs, it will attempt to replace any table under the same schema with the name of the specified diff table prefix (defaults to __diff_table__) + the name of the secondary table.

  3. After Table Differ is finished running, it does not drop the table that it has created within your database, and only replaces that table at the beginning of the next run so that the diff table can be queried if needed between usages.

Installation

    pip install table-differ

Note that from within a Databricks notebook, you will need to use a Databricks magic command to run the import statement directly (%)

Command Line Usage

Connecting to a Database

Depending on the database that Table-Differ will attempt to connect to, the process of connection may be slightly different. If using PostgreSQL or MySQL , Table-Differ will automatically search for a .pgpass or .my.cnf file for the required connection details. If using a database locally, Table-Differ accepts an additional optional requirement to use a directory pathing that is pulled from the config.yaml file.


Notebook Usage

Connecting to table-differ

After installing table-differ through pip directly into your notebook environment, you will need to import the required class to be able to make calls.

from Table_Differ.table_differ_notebook import TableDiffer

After importing, table-differ can be called by supplying the TableDiffer class with two arguments:

  1. conn which requires some sort of connection object. This is initially built out in a DataBricks environment, so a SparkSession object would be passed into this field

  2. args requires a dictionary of arguments. This dictionary can be sourced from wherever you like, as long as it contains the required fields. These arguments are outlined below in the Arguments and Configs section of this document.

Example Usage

%python
%pip install table-differ

from Table_Differ.table_differ_notebook import TableDiffer
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .getOrCreate()
)

args = {
    "table_initial": "catalog.schema.table_name",
    "table_secondary": "catalog.schema.table_name",
    "key_cols": ["key_col_1"],
    "comp_cols": ["comp_col_1", "comp_col_2", "comp_col_3",]
}

TableDiffer(conn = spark, args = args)

It's important to note that any field within Arguments or Configs can be added to this arg dictionary, but it needs to be the fully spelled out name and not just the prefix.

Default Arguments for Notebook Usage

# Database
    db_host:                str         = None              -> db arguments are not used in notebook and not required
    db_port:                str         = None
    db_name:                str         = None
    db_user:                str         = None
    db_path:                str         = None
    db_type:                str         = "databricks"

# Table Info
    table_initial:          str                             -> table_initial/secondary are Required
    table_secondary:        str
    table_diff:             str         = "__diff_table__" + str(date.today()) + "__"
    schema_name:            str         = "*******_dev"     -> Actual value hidden. Contact owner for more information
    key_cols:               list[str]                       -> key_cols are Required
    comp_cols:              list[str]   = None              -> comp_cols are Required
    ignore_cols:            list[str]   = None
    initial_table_alias:    str         = "origin"
    secondary_table_alias:  str         = "comparison"
    except_rows:            list[str]   = None

# System
    local_db:               bool        = False
    print_tables:           bool        = False
    report_mode:            bool        = False
    print_query:            bool        = True              -> When set to 'True' table-differ will sys.exit(0) after printing query
    gnome:                  bool        = False
    log_level:              str         = "warning"

Data-Profiler

    Note: This is a new feature and currently being developed further

Default Arguments for Data Profiler

# System
    table_name:                         -> Required
    table_schema:                       -> Required
    table_catalog:          str         -> Required

    where_clause:           str         = None

Arguments and Configs

The developement of Table Differ focuses on making a product that is as flexible and adaptable as possible, and because of this uses several potential arguments.

Example Run of Table-Differ

./main.py -c <col_1> <col_2> <col_n> -d <database> -t <first_table> <second_table> -k <key_col> -l <log_level>

Required Arguments

-k   --key_cols             Key Columns to track rows by within **both** tables.
                            This field takes n number of arguments

Optional Arguments

-c --comparison_cols        columns to be specifically compared.
                            (columns not specified will **not** be looked at when creating the diff table).
                            This field takes n number of arguments.

**OR**

 -i  --ignore_cols          columns to be specifically ignored.
                            (columns not specified **will** be looked at when creating the diff table).
                            This field takes n number of arguments.

     --config-file          Sources arguments from specified file


|
| Note that while --config-file and -c / -i are Optional Arguments, one must be present
|


     --db_type            The type of database that Table Differ will attempt to connect to.
                          Supported DBs: sqlite, postgres, duckdb, databricks

     --except_rows        Rows to be excluded from the creation of diff_table by their key.
                          This exists in case you would like to skip over a specific row.
                          This field take n number of arguments.

-l   --logging_level      Sets the logging level of Table Differ.|debug, info, warning, error, critical|warning|

-p   --print_tables       Prints the tables used in the creation of the diff_table to the console using Rich
                          This is useful in testing and should be used only on very small tables.

     --local-db           Specifies that the db path is to a local database on the running machine

     --report-mode        Flag to only run reporting on most recent __diff_table__.
                          This will not create or replace a new __diff_table__, but will still query the original
                          2 tables for some reports.

Configs (stored within the configs.yaml file)

db_host                   Hosting URL for database being used.

db_port                   Port number for database connection.

db_name                   Name of database being connected to.

db_user                   Username of user for database.

db_type                   Type of database being connected to.

db_path                   Path of database

schema_name               Name of schema being used.
                          Default is set to 'public'.

table_initial             Name of the first table that will be used to create the diff_table.
                          This can be supplied either here in the config file or as an argument if the --configs argument is set to 'n'.

table_secondary           Name of the second table that will be used to create the diff_table.
                          This can be supplied either here in the config file or as an argument if the --configs argument is set to 'n'.

diff_table_prefix         Value that Table-Differ will name the created diff table as. Default is __diff_table__.

key_columns               Key columns that Table Differ will query the selected tables by.
                          This can be supplied either here in the config file or as an argument if the --configs argument is set to 'n'.
                          This field accepts n number of fields

compare_cols              Columns to compare

ignore_cols               Columns to ignore

initial_table_alias        Placeholder name of the first table being queried in creation of the diff_table.
                          Default is set to 'origin'.

secondary_table_alias      Placeholder name of the second table being queried in creation of the diff_table.
                          Default is set to 'comparison'.

Use Cases of Table Differ

While Table Differ obviously works very well at comparing a history of a single table, it is not limited to just that. Because of the emphasis on flexibility and usability, Table Differ is designed to be used in any case where you need to see the specific differences between two tables within a database.


diff_table capabilities

Rows that are included into the diff_table have to meet specific conditions:

  1. Rows that have a changed value from the initial to secondary table
  2. Rows that exist in the initial table but not in the secondary
  3. Rows that exist in the secondary table but not in the initial

Note that all rows are compared based on given key columns, and only columns specified (or in the case of ignored columns, not specified) will be looked into for changed. If a change on a row exists but in a column not specified, it will not be added to the diff_table based on that change.


CREDITS

  1. Benjamin Farmer - San Juan Data LLC. (2023)

License

Copyright (c) [2023] [Benjamin Farmer]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc