The data validation toolkit for teams that care about building better data
Introduction
Recce is data validation toolkit for pull request (PR) review in dbt projects. Get enhanced visibility into how your team’s dbt modeling changes impact data by comparing your dev branch with stable production data. Run manual data checks during development, and automate checks in CI for PR review.
Quick Start
Get up and running quickly by prepping your dev and prod environments. The key is building prod into the target-base folder to use as the base for the data comparison.
# Build prod and generate dbt docs into ./target-base
dbt seed --target prod
dbt run --target prod
dbt docs generate --target prod --target-path ./target-base
# Switch to your dev branch
git switch my-awesome-branch
# build your dev environment
dbt seed
dbt run
dbt docs generate
# Start a Recce Instance
recce server
We provide three online Recce demos (based on Jaffle Shop), each is related to a specific pull request. Use these demos to inspect the data impact caused by the modeling changes in the PR.
For each demo, review the following:
The pull request comment
The code changes
How the lineage and data has changed in Recce
This will enable you to validate if the intention of the PR has been successfully implemented without unintended impact.
[!TIP]
Don't forget to click the Checks tab to view the Recce Checklist, and perform your own Checks for further investigation.
Demo 1: Calculation logic change
This pull request adjusts the logic for how customer lifetime value is calculated:
This pull request performs some refactoring on the customers model by turning two CTEs into intermediate models, enhancing readability and maintainability:
dbt has brought many software best practices to data projects, such as:
Version controlled code
Modular SQL
Reproducible pipelines
Even so, 'bad merges' still happen and erroneous data and silent errors make their way into prod data. As self-serve analytics opens dbt projects to many roles, and the size of dbt projects increase, the job of reviewing data modeling changes is even more critical.
The only way to understand the impact of code changes on data is to compare the data before-and-after the changes.
Features
Recce provides a data review environment for data teams to check their work during development, and then again as part of PR review. The suite of tools and diffs in Recce are specifically geared towards surfacing, understanding, and recording data impact from code changes.
Lineage Diff
Lineage Diff is the main interface to Recce and shows which nodes in the lineage have been added, removed, or modified.
Structural Diffs
Schema Diff - Show the struture of the table including added or removed columns
Advanced Diffs provide high level statistics about data change:
Profile Diff: Compares stats such as count, distinct count, min, max, average.
Value Diff: The matched count and percentage for each column in the table.
Top-K Diff: Compares the distribution of a categorical column.
Histogram Diff: Compares the distribution of a numeric column in an overlay histogram chart.
Query Diff
Query Diff compares the results of any ad-hoc query, and supports the use of dbt macros.
Checklist
The checklist provides a way to record the results of your data validation process.
Save the results of checks
Re-run checks
Annotate checks to add context
Share the results of checks
(Recce Cloud) Sync checks and check results across Recce instances
(Recce Cloud) Block PR merging until checks have been approved
Who's using Recce?
Recce is useful for validating your own work or the work of others, and can also be used to share data impact with non-technical stakeholders to approve data checks.
Data engineers can use Recce to ensure the structural integrity of the data and understand the scope of impact before merging.
Analysts can use Recce to self-review and understand how data modeling changes have changed the data.
Stakeholders can use Recce to sign-off on data after updates have been made
We found that recce demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Malicious Go packages are impersonating popular libraries to install hidden loader malware on Linux and macOS, targeting developers with obfuscated payloads.
Bybit's $1.46B hack by North Korea's Lazarus Group pushes 2025 crypto losses to $1.6B in just two months, already surpassing all of 2024's $1.49B total.
OpenSSF has published OSPS Baseline, an initiative designed to establish a minimum set of security-related best practices for open source software projects.