Data Contract CLI
The datacontract CLI is an open-source command-line tool for working with data contracts.
It natively supports the Open Data Contract Standard to lint data contracts, connect to data sources and execute schema and quality tests, and export to different formats.
The tool is written in Python.
It can be used as a standalone CLI tool, in a CI/CD pipeline, or directly as a Python library.

Getting started
Let's look at this data contract:
https://datacontract.com/orders-v1.odcs.yaml
We have a servers section with endpoint details to a Postgres database, schema for the structure and semantics of the data, service levels and quality attributes that describe the expected freshness and number of rows.
This data contract contains all information to connect to the database and check that the actual data meets the defined schema specification and quality expectations.
We can use this information to test if the actual data product is compliant to the data contract.
Let's use uv to install the CLI (or use the Docker image),
$ uv tool install --python python3.11 --upgrade 'datacontract-cli[all]'
Now, let's run the tests:
$ export DATACONTRACT_POSTGRES_USERNAME=datacontract_cli.egzhawjonpfweuutedfy
$ export DATACONTRACT_POSTGRES_PASSWORD=jio10JuQfDfl9JCCPdaCCpuZ1YO
$ datacontract test https://datacontract.com/orders-v1.odcs.yaml
Testing https://datacontract.com/orders-v1.odcs.yaml
Server: production (type=postgres, host=aws-1-eu-central-2.pooler.supabase.com, port=6543, database=postgres, schema=dp_orders_v1)
โญโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโฎ
โ Result โ Check โ Field โ Details โ
โโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโค
โ passed โ Check that field 'line_item_id' is present โ line_items.line_item_id โ โ
โ passed โ Check that field line_item_id has type UUID โ line_items.line_item_id โ โ
โ passed โ Check that field line_item_id has no missing values โ line_items.line_item_id โ โ
โ passed โ Check that field 'order_id' is present โ line_items.order_id โ โ
โ passed โ Check that field order_id has type UUID โ line_items.order_id โ โ
โ passed โ Check that field 'price' is present โ line_items.price โ โ
โ passed โ Check that field price has type INTEGER โ line_items.price โ โ
โ passed โ Check that field price has no missing values โ line_items.price โ โ
โ passed โ Check that field 'sku' is present โ line_items.sku โ โ
โ passed โ Check that field sku has type TEXT โ line_items.sku โ โ
โ passed โ Check that field sku has no missing values โ line_items.sku โ โ
โ passed โ Check that field 'customer_id' is present โ orders.customer_id โ โ
โ passed โ Check that field customer_id has type TEXT โ orders.customer_id โ โ
โ passed โ Check that field customer_id has no missing values โ orders.customer_id โ โ
โ passed โ Check that field 'order_id' is present โ orders.order_id โ โ
โ passed โ Check that field order_id has type UUID โ orders.order_id โ โ
โ passed โ Check that field order_id has no missing values โ orders.order_id โ โ
โ passed โ Check that unique field order_id has no duplicate values โ orders.order_id โ โ
โ passed โ Check that field 'order_status' is present โ orders.order_status โ โ
โ passed โ Check that field order_status has type TEXT โ orders.order_status โ โ
โ passed โ Check that field 'order_timestamp' is present โ orders.order_timestamp โ โ
โ passed โ Check that field order_timestamp has type TIMESTAMPTZ โ orders.order_timestamp โ โ
โ passed โ Check that field 'order_total' is present โ orders.order_total โ โ
โ passed โ Check that field order_total has type INTEGER โ orders.order_total โ โ
โ passed โ Check that field order_total has no missing values โ orders.order_total โ โ
โฐโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโฏ
๐ข data contract is valid. Run 25 checks. Took 3.938887 seconds.
Voilร , the CLI tested that the YAML itself is valid, all records comply with the schema, and all quality attributes are met.
We can also use the data contract metadata to export in many formats, e.g., to generate a SQL DDL:
$ datacontract export --format sql https://datacontract.com/orders-v1.odcs.yaml
-- Data Contract: orders
-- SQL Dialect: postgres
CREATE TABLE orders (
order_id None not null primary key,
customer_id text not null,
order_total integer not null,
order_timestamp None,
order_status text
);
CREATE TABLE line_items (
line_item_id None not null primary key,
sku text not null,
price integer not null,
order_id None
);
Or generate an HTML export:
$ datacontract export --format html --output orders-v1.odcs.html https://datacontract.com/orders-v1.odcs.yaml
Usage
$ datacontract init odcs.yaml
$ datacontract lint odcs.yaml
$ datacontract test odcs.yaml
$ datacontract export --format html datacontract.yaml --output odcs.html
$ datacontract import --format sql --source my-ddl.sql --dialect postgres --output odcs.yaml
$ datacontract import --format excel --source odcs.xlsx --output odcs.yaml
$ datacontract export --format excel --output odcs.xlsx odcs.yaml
Programmatic (Python)
from datacontract.data_contract import DataContract
data_contract = DataContract(data_contract_file="odcs.yaml")
run = data_contract.test()
if not run.has_passed():
print("Data quality validation failed.")
How to
Installation
Choose the most appropriate installation method for your needs:
uv
The preferred way to install is uv:
uv tool install --python python3.11 --upgrade 'datacontract-cli[all]'
uvx
If you have uv installed, you can run datacontract-cli directly without installing:
uv run --with 'datacontract-cli[all]' datacontract --version
pip
Python 3.10, 3.11, and 3.12 are supported. We recommend using Python 3.11.
python3 -m pip install 'datacontract-cli[all]'
datacontract --version
pip with venv
Typically it is better to install the application in a virtual environment for your projects:
cd my-project
python3.11 -m venv venv
source venv/bin/activate
pip install 'datacontract-cli[all]'
datacontract --version
pipx
pipx installs into an isolated environment.
pipx install 'datacontract-cli[all]'
datacontract --version
Docker
You can also use our Docker image to run the CLI tool. It is also convenient for CI/CD pipelines.
docker pull datacontract/cli
docker run --rm -v ${PWD}:/home/datacontract datacontract/cli
You can create an alias for the Docker command to make it easier to use:
alias datacontract='docker run --rm -v "${PWD}:/home/datacontract" datacontract/cli:latest'
Note: The output of Docker command line messages is limited to 80 columns and may include line breaks. Don't pipe docker output to files if you want to export code. Use the --output option instead.
The CLI tool defines several optional dependencies (also known as extras) that can be installed for using with specific servers types.
With all, all server dependencies are included.
uv tool install --python python3.11 --upgrade 'datacontract-cli[all]'
A list of available extras:
| Amazon Athena | pip install datacontract-cli[athena] |
| Avro Support | pip install datacontract-cli[avro] |
| Google BigQuery | pip install datacontract-cli[bigquery] |
| Databricks Integration | pip install datacontract-cli[databricks] |
| Iceberg | pip install datacontract-cli[iceberg] |
| Kafka Integration | pip install datacontract-cli[kafka] |
| PostgreSQL Integration | pip install datacontract-cli[postgres] |
| S3 Integration | pip install datacontract-cli[s3] |
| Snowflake Integration | pip install datacontract-cli[snowflake] |
| Microsoft SQL Server | pip install datacontract-cli[sqlserver] |
| Trino | pip install datacontract-cli[trino] |
| Impala | pip install datacontract-cli[impala] |
| dbt | pip install datacontract-cli[dbt] |
| DBML | pip install datacontract-cli[dbml] |
| Parquet | pip install datacontract-cli[parquet] |
| RDF | pip install datacontract-cli[rdf] |
| API (run as web server) | pip install datacontract-cli[api] |
| protobuf | pip install datacontract-cli[protobuf] |
Documentation
Commands
init
Usage: datacontract init [OPTIONS] [LOCATION]
Create an empty data contract.
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ location [LOCATION] The location of the data contract file to create. โ
โ [default: datacontract.yaml] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --template TEXT URL of a template or data contract [default: None] โ
โ --overwrite --no-overwrite Replace the existing datacontract.yaml โ
โ [default: no-overwrite] โ
โ --debug --no-debug Enable debug logging [default: no-debug] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
lint
Usage: datacontract lint [OPTIONS] [LOCATION]
Validate that the datacontract.yaml is correctly formatted.
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ location [LOCATION] The location (url or path) of the data contract yaml. โ
โ [default: datacontract.yaml] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --schema TEXT The location (url or path) of the ODCS JSON Schema โ
โ [default: None] โ
โ --output PATH Specify the file path where the test results should be โ
โ written to (e.g., โ
โ './test-results/TEST-datacontract.xml'). If no path is โ
โ provided, the output will be printed to stdout. โ
โ [default: None] โ
โ --output-format [junit] The target format for the test results. โ
โ [default: None] โ
โ --debug --no-debug Enable debug logging [default: no-debug] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
test
Usage: datacontract test [OPTIONS] [LOCATION]
Run schema and quality tests on configured servers.
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ location [LOCATION] The location (url or path) of the data contract yaml. โ
โ [default: datacontract.yaml] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --schema TEXT The location (url or path) of โ
โ the ODCS JSON Schema โ
โ [default: None] โ
โ --server TEXT The server configuration to run โ
โ the schema and quality tests. โ
โ Use the key of the server object โ
โ in the data contract yaml file โ
โ to refer to a server, e.g., โ
โ `production`, or `all` for all โ
โ servers (default). โ
โ [default: all] โ
โ --publish-test-results --no-publish-test-results Deprecated. Use publish โ
โ parameter. Publish the results โ
โ after the test โ
โ [default: โ
โ no-publish-test-results] โ
โ --publish TEXT The url to publish the results โ
โ after the test. โ
โ [default: None] โ
โ --output PATH Specify the file path where the โ
โ test results should be written โ
โ to (e.g., โ
โ './test-results/TEST-datacontraโฆ โ
โ [default: None] โ
โ --output-format [junit] The target format for the test โ
โ results. โ
โ [default: None] โ
โ --logs --no-logs Print logs [default: no-logs] โ
โ --ssl-verification --no-ssl-verification SSL verification when publishing โ
โ the data contract. โ
โ [default: ssl-verification] โ
โ --debug --no-debug Enable debug logging โ
โ [default: no-debug] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Data Contract CLI connects to a data source and runs schema and quality tests to verify that the data contract is valid.
$ datacontract test --server production datacontract.yaml
To connect to the databases the server block in the datacontract.yaml is used to set up the connection.
In addition, credentials, such as username and passwords, may be defined with environment variables.
The application uses different engines, based on the server type.
Internally, it connects with DuckDB, Spark, or a native connection and executes the most tests with soda-core and fastjsonschema.
Credentials are provided with environment variables.
Supported server types:
Supported formats:
- parquet
- json
- csv
- delta
- iceberg (coming soon)
Feel free to create an issue, if you need support for an additional type and formats.
S3
Data Contract CLI can test data that is stored in S3 buckets or any S3-compliant endpoints in various formats.
- CSV
- JSON
- Delta
- Parquet
- Iceberg (coming soon)
Examples
JSON
datacontract.yaml
servers:
production:
type: s3
endpointUrl: https://minio.example.com
location: s3://bucket-name/path/*/*.json
format: json
delimiter: new_line
Delta Tables
datacontract.yaml
servers:
production:
type: s3
endpointUrl: https://minio.example.com
location: s3://bucket-name/path/table.delta
format: delta
Environment Variables
DATACONTRACT_S3_REGION | eu-central-1 | Region of S3 bucket |
DATACONTRACT_S3_ACCESS_KEY_ID | AKIAXV5Q5QABCDEFGH | AWS Access Key ID |
DATACONTRACT_S3_SECRET_ACCESS_KEY | 93S7LRrJcqLaaaa/XXXXXXXXXXXXX | AWS Secret Access Key |
DATACONTRACT_S3_SESSION_TOKEN | AQoDYXdzEJr... | AWS temporary session token (optional) |
Athena
Data Contract CLI can test data in AWS Athena stored in S3.
Supports different file formats, such as Iceberg, Parquet, JSON, CSV...
Example
datacontract.yaml
servers:
athena:
type: athena
catalog: awsdatacatalog
schema: icebergdemodb
regionName: eu-central-1
stagingDir: s3://my-bucket/athena-results/
models:
my_table:
type: table
fields:
my_column_1:
type: string
config:
physicalType: varchar
Environment Variables
DATACONTRACT_S3_REGION | eu-central-1 | Region of Athena service |
DATACONTRACT_S3_ACCESS_KEY_ID | AKIAXV5Q5QABCDEFGH | AWS Access Key ID |
DATACONTRACT_S3_SECRET_ACCESS_KEY | 93S7LRrJcqLaaaa/XXXXXXXXXXXXX | AWS Secret Access Key |
DATACONTRACT_S3_SESSION_TOKEN | AQoDYXdzEJr... | AWS temporary session token (optional) |
Google Cloud Storage (GCS)
The S3 integration also works with files on Google Cloud Storage through its interoperability.
Use https://storage.googleapis.com as the endpoint URL.
Example
datacontract.yaml
servers:
production:
type: s3
endpointUrl: https://storage.googleapis.com
location: s3://bucket-name/path/*/*.json
format: json
delimiter: new_line
Environment Variables
DATACONTRACT_S3_ACCESS_KEY_ID | GOOG1EZZZ... | The GCS HMAC Key Key ID |
DATACONTRACT_S3_SECRET_ACCESS_KEY | PDWWpb... | The GCS HMAC Key Secret |
BigQuery
We support authentication to BigQuery using Service Account Key. The used Service Account should include the roles:
- BigQuery Job User
- BigQuery Data Viewer
Example
datacontract.yaml
servers:
production:
type: bigquery
project: datameshexample-product
dataset: datacontract_cli_test_dataset
models:
datacontract_cli_test_table:
type: table
fields: ...
Environment Variables
DATACONTRACT_BIGQUERY_ACCOUNT_INFO_JSON_PATH | ~/service-access-key.json | Service Access key as saved on key creation by BigQuery. If this environment variable isn't set, the cli tries to use GOOGLE_APPLICATION_CREDENTIALS as a fallback, so if you have that set for using their Python library anyway, it should work seamlessly. |
Azure
Data Contract CLI can test data that is stored in Azure Blob storage or Azure Data Lake Storage (Gen2) (ADLS) in various formats.
Example
datacontract.yaml
servers:
production:
type: azure
storageAccount: datameshdatabricksdemo
location: abfss://dataproducts/inventory_events/*.parquet
format: parquet
Environment Variables
Authentication works with an Azure Service Principal (SPN) aka App Registration with a secret.
DATACONTRACT_AZURE_TENANT_ID | 79f5b80f-10ff-40b9-9d1f-774b42d605fc | The Azure Tenant ID |
DATACONTRACT_AZURE_CLIENT_ID | 3cf7ce49-e2e9-4cbc-a922-4328d4a58622 | The ApplicationID / ClientID of the app registration |
DATACONTRACT_AZURE_CLIENT_SECRET | yZK8Q~GWO1MMXXXXXXXXXXXXX | The Client Secret value |
Sqlserver
Data Contract CLI can test data in MS SQL Server (including Azure SQL, Synapse Analytics SQL Pool).
Example
datacontract.yaml
servers:
production:
type: sqlserver
host: localhost
port: 5432
database: tempdb
schema: dbo
driver: ODBC Driver 18 for SQL Server
models:
my_table_1:
type: table
fields:
my_column_1:
type: varchar
Environment Variables
DATACONTRACT_SQLSERVER_USERNAME | root | Username |
DATACONTRACT_SQLSERVER_PASSWORD | toor | Password |
DATACONTRACT_SQLSERVER_TRUSTED_CONNECTION | True | Use windows authentication, instead of login |
DATACONTRACT_SQLSERVER_TRUST_SERVER_CERTIFICATE | True | Trust self-signed certificate |
DATACONTRACT_SQLSERVER_ENCRYPTED_CONNECTION | True | Use SSL |
DATACONTRACT_SQLSERVER_DRIVER | ODBC Driver 18 for SQL Server | ODBC driver name |
Oracle
Data Contract CLI can test data in Oracle Database.
Example
datacontract.yaml
servers:
oracle:
type: oracle
host: localhost
port: 1521
service_name: ORCL
schema: ADMIN
models:
my_table_1:
type: table
fields:
my_column_1:
type: decimal
description: Decimal number
my_column_2:
type: text
description: Unicode text string
config:
oracleType: NVARCHAR2
Environment Variables
These environment variable specify the credentials used by the datacontract tool to connect to the database.
If you've started the database from a container, e.g. oracle-free
this should match either system and what you specified as ORACLE_PASSWORD on the container or
alternatively what you've specified under APP_USER and APP_USER_PASSWORD.
If you require thick mode to connect to the database, you need to have an Oracle Instant Client
installed on the system and specify the path to the installation within the environment variable
DATACONTRACT_ORACLE_CLIENT_DIR.
DATACONTRACT_ORACLE_USERNAME | system | Username |
DATACONTRACT_ORACLE_PASSWORD | 0x162e53 | Password |
DATACONTRACT_ORACLE_CLIENT_DIR | C:\oracle\client | Path to Oracle Instant Client installation |
Databricks
Works with Unity Catalog and Hive metastore.
Needs a running SQL warehouse or compute cluster.
Example
datacontract.yaml
servers:
production:
type: databricks
catalog: acme_catalog_prod
schema: orders_latest
models:
orders:
type: table
fields: ...
Environment Variables
DATACONTRACT_DATABRICKS_TOKEN | dapia00000000000000000000000000000 | The personal access token to authenticate |
DATACONTRACT_DATABRICKS_HTTP_PATH | /sql/1.0/warehouses/b053a3ffffffff | The HTTP path to the SQL warehouse or compute cluster |
DATACONTRACT_DATABRICKS_SERVER_HOSTNAME | dbc-abcdefgh-1234.cloud.databricks.com | The host name of the SQL warehouse or compute cluster |
Databricks (programmatic)
Works with Unity Catalog and Hive metastore.
When running in a notebook or pipeline, the provided spark session can be used.
An additional authentication is not required.
Requires a Databricks Runtime with Python >= 3.10.
Example
datacontract.yaml
servers:
production:
type: databricks
host: dbc-abcdefgh-1234.cloud.databricks.com
catalog: acme_catalog_prod
schema: orders_latest
models:
orders:
type: table
fields: ...
Installing on Databricks Compute
Important: When using Databricks LTS ML runtimes (15.4, 16.4), installing via %pip install in notebooks can cause issues.
Recommended approach: Use Databricks' native library management instead:
-
Create or configure your compute cluster:
- Navigate to Compute in the Databricks workspace
- Create a new cluster or select an existing one
- Go to the Libraries tab
-
Add the datacontract-cli library:
- Click Install new
- Select PyPI as the library source
- Enter package name:
datacontract-cli[databricks]
- Click Install
-
Restart the cluster to apply the library installation
-
Use in your notebook without additional installation:
from datacontract.data_contract import DataContract
data_contract = DataContract(
data_contract_file="/Volumes/acme_catalog_prod/orders_latest/datacontract/datacontract.yaml",
spark=spark)
run = data_contract.test()
run.result
Databricks' library management properly resolves dependencies during cluster initialization, rather than at runtime in the notebook.
Dataframe (programmatic)
Works with Spark DataFrames.
DataFrames need to be created as named temporary views.
Multiple temporary views are supported if your data contract contains multiple models.
Testing DataFrames is useful to test your datasets in a pipeline before writing them to a data source.
Example
datacontract.yaml
servers:
production:
type: dataframe
models:
my_table:
type: table
fields: ...
Example code
from datacontract.data_contract import DataContract
df.createOrReplaceTempView("my_table")
data_contract = DataContract(
data_contract_file="datacontract.yaml",
spark=spark,
)
run = data_contract.test()
assert run.result == "passed"
Snowflake
Data Contract CLI can test data in Snowflake.
Example
datacontract.yaml
servers:
snowflake:
type: snowflake
account: abcdefg-xn12345
database: ORDER_DB
schema: ORDERS_PII_V2
models:
my_table_1:
type: table
fields:
my_column_1:
type: varchar
Environment Variables
All parameters supported by Soda, uppercased and prepended by DATACONTRACT_SNOWFLAKE_ prefix.
For example:
username | DATACONTRACT_SNOWFLAKE_USERNAME |
password | DATACONTRACT_SNOWFLAKE_PASSWORD |
warehouse | DATACONTRACT_SNOWFLAKE_WAREHOUSE |
role | DATACONTRACT_SNOWFLAKE_ROLE |
connection_timeout | DATACONTRACT_SNOWFLAKE_CONNECTION_TIMEOUT |
Beware, that parameters:
are obtained from the servers section of the YAML-file.
E.g. from the example above:
servers:
snowflake:
account: abcdefg-xn12345
database: ORDER_DB
schema: ORDERS_PII_V2
Kafka
Kafka support is currently considered experimental.
Example
datacontract.yaml
servers:
production:
type: kafka
host: abc-12345.eu-central-1.aws.confluent.cloud:9092
topic: my-topic-name
format: json
Environment Variables
DATACONTRACT_KAFKA_SASL_USERNAME | xxx | The SASL username (key). |
DATACONTRACT_KAFKA_SASL_PASSWORD | xxx | The SASL password (secret). |
DATACONTRACT_KAFKA_SASL_MECHANISM | PLAIN | Default PLAIN. Other supported mechanisms: SCRAM-SHA-256 and SCRAM-SHA-512 |
Postgres
Data Contract CLI can test data in Postgres or Postgres-compliant databases (e.g., RisingWave).
Example
datacontract.yaml
servers:
postgres:
type: postgres
host: localhost
port: 5432
database: postgres
schema: public
models:
my_table_1:
type: table
fields:
my_column_1:
type: varchar
Environment Variables
DATACONTRACT_POSTGRES_USERNAME | postgres | Username |
DATACONTRACT_POSTGRES_PASSWORD | mysecretpassword | Password |
Trino
Data Contract CLI can test data in Trino.
Example
datacontract.yaml
servers:
trino:
type: trino
host: localhost
port: 8080
catalog: my_catalog
schema: my_schema
models:
my_table_1:
type: table
fields:
my_column_1:
type: varchar
my_column_2:
type: object
config:
trinoType: row(en_us varchar, pt_br varchar)
Environment Variables
DATACONTRACT_TRINO_USERNAME | trino | Username |
DATACONTRACT_TRINO_PASSWORD | mysecretpassword | Password |
Impala
Data Contract CLI can run Soda checks against an Apache Impala cluster.
Example
datacontract.yaml
servers:
impala:
type: impala
host: my-impala-host
port: 443
database: my_database
models:
my_table_1:
type: table
Environment Variables
DATACONTRACT_IMPALA_USERNAME | analytics_user | Username used to connect to Impala |
DATACONTRACT_IMPALA_PASSWORD | mysecretpassword | Password for the Impala user |
DATACONTRACT_IMPALA_USE_SSL | true | Whether to use SSL; defaults to true if unset |
DATACONTRACT_IMPALA_AUTH_MECHANISM | LDAP | Authentication mechanism; defaults to LDAP |
DATACONTRACT_IMPALA_USE_HTTP_TRANSPORT | true | Whether to use the HTTP transport; defaults to true |
DATACONTRACT_IMPALA_HTTP_PATH | cliservice | HTTP path for the Impala service; defaults to cliservice |
Type-mapping note (logicalType โ Impala type)
If physicalType is not specified in the schema, we recommend the following mapping from logicalType to Impala column types:
integer | INT or BIGINT |
number | DOUBLE/decimal(..) |
string | STRING or VARCHAR |
boolean | BOOLEAN |
date | DATE |
datetime | TIMESTAMP |
This keeps the Impala schema compatible with the expectations of the Soda checks generated by datacontract-cli.
API
Data Contract CLI can test APIs that return data in JSON format.
Currently, only GET requests are supported.
Example
datacontract.yaml
servers:
api:
type: "api"
location: "https://api.example.com/path"
delimiter: none
models:
my_object:
type: object
fields:
field1:
type: string
fields2:
type: number
Environment Variables
DATACONTRACT_API_HEADER_AUTHORIZATION | Bearer <token> | The value for the authorization header. Optional. |
Local
Data Contract CLI can test local files in parquet, json, csv, or delta format.
Example
datacontract.yaml
servers:
local:
type: local
path: ./*.parquet
format: parquet
models:
my_table_1:
type: table
fields:
my_column_1:
type: varchar
my_column_2:
type: string
export
Usage: datacontract export [OPTIONS] [LOCATION]
Convert data contract to a specific format. Saves to file specified by `output` option if present,
otherwise prints to stdout.
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ location [LOCATION] The location (url or path) of the data contract yaml. โ
โ [default: datacontract.yaml] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --format [jsonschema|pydantic-model|sod The export format. โ
โ acl|dbt|dbt-sources|dbt-stagin [default: None] โ
โ g-sql|odcs|rdf|avro|protobuf|g [required] โ
โ reat-expectations|avro-idl|sql โ
โ |sql-query|mermaid|html|go|big โ
โ query|dbml|spark|sqlalchemy|da โ
โ ta-caterer|dcs|markdown|iceber โ
โ g|custom|excel|dqx] โ
โ --output PATH Specify the file path where โ
โ the exported data will be โ
โ saved. If no path is provided, โ
โ the output will be printed to โ
โ stdout. โ
โ [default: None] โ
โ --server TEXT The server name to export. โ
โ [default: None] โ
โ --schema-name TEXT The name of the schema to โ
โ export, e.g., `orders`, or โ
โ `all` for all schemas โ
โ (default). โ
โ [default: all] โ
โ --schema TEXT The location (url or path) of โ
โ the ODCS JSON Schema โ
โ [default: None] โ
โ --engine TEXT [engine] The engine used for โ
โ great expection run. โ
โ [default: None] โ
โ --template PATH The file path or URL of a โ
โ template. For Excel format: โ
โ path/URL to custom Excel โ
โ template. For custom format: โ
โ path to Jinja template. โ
โ [default: None] โ
โ --debug --no-debug Enable debug logging โ
โ [default: no-debug] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ RDF Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --rdf-base TEXT [rdf] The base URI used to generate the RDF graph. [default: None] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ SQL Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --sql-server-type TEXT [sql] The server type to determine the sql dialect. By default, โ
โ it uses 'auto' to automatically detect the sql dialect via the โ
โ specified servers in the data contract. โ
โ [default: auto] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
datacontract export --format html --output datacontract.html
Available export options:
html | Export to HTML | โ
|
jsonschema | Export to JSON Schema | โ
|
odcs | Export to Open Data Contract Standard (ODCS) V3 | โ
|
sodacl | Export to SodaCL quality checks in YAML format | โ
|
dbt | Export to dbt models in YAML format | โ
|
dbt-sources | Export to dbt sources in YAML format | โ
|
dbt-staging-sql | Export to dbt staging SQL models | โ
|
rdf | Export data contract to RDF representation in N3 format | โ
|
avro | Export to AVRO models | โ
|
protobuf | Export to Protobuf | โ
|
terraform | Export to terraform resources | โ
|
sql | Export to SQL DDL | โ
|
sql-query | Export to SQL Query | โ
|
great-expectations | Export to Great Expectations Suites in JSON Format | โ
|
bigquery | Export to BigQuery Schemas | โ
|
go | Export to Go types | โ
|
pydantic-model | Export to pydantic models | โ
|
DBML | Export to a DBML Diagram description | โ
|
spark | Export to a Spark StructType | โ
|
sqlalchemy | Export to SQLAlchemy Models | โ
|
data-caterer | Export to Data Caterer in YAML format | โ
|
dcs | Export to Data Contract Specification in YAML format | โ
|
markdown | Export to Markdown | โ
|
iceberg | Export to an Iceberg JSON Schema Definition | partial |
excel | Export to ODCS Excel Template | โ
|
custom | Export to Custom format with Jinja | โ
|
dqx | Export to DQX in YAML format | โ
|
| Missing something? | Please create an issue on GitHub | TBD |
SQL
The export function converts a given data contract into a SQL data definition language (DDL).
datacontract export datacontract.yaml --format sql --output output.sql
If using Databricks, and an error is thrown when trying to deploy the SQL DDLs with variant columns set the following properties.
spark.conf.set(โspark.databricks.delta.schema.typeCheck.enabledโ, โfalseโ)
Great Expectations
The export function transforms a specified data contract into a comprehensive Great Expectations JSON suite.
If the contract includes multiple models, you need to specify the names of the schema/models you wish to export.
datacontract export datacontract.yaml --format great-expectations --model orders
The export creates a list of expectations by utilizing:
- The data from the Model definition with a fixed mapping
- The expectations provided in the quality field for each model (find here the expectations gallery: Great Expectations Gallery)
Additional Arguments
To further customize the export, the following optional arguments are available:
-
suite_name: The name of the expectation suite. This suite groups all generated expectations and provides a convenient identifier within Great Expectations. If not provided, a default suite name will be generated based on the model name(s).
-
engine: Specifies the engine used to run Great Expectations checks. Accepted values are:
pandas โ Use this when working with in-memory data frames through the Pandas library.
spark โ Use this for working with Spark dataframes.
sql โ Use this for working with SQL databases.
-
sql_server_type: Specifies the type of SQL server to connect with when engine is set to sql.
Providing sql_server_type ensures that the appropriate SQL dialect and connection settings are applied during the expectation validation.
RDF
The export function converts a given data contract into a RDF representation. You have the option to
add a base_url which will be used as the default prefix to resolve relative IRIs inside the document.
datacontract export --format rdf --rdf-base https://www.example.com/ datacontract.yaml
The data contract is mapped onto the following concepts of a yet to be defined Data Contract
Ontology named https://datacontract.com/DataContractSpecification/ :
- DataContract
- Server
- Model
Having the data contract inside an RDF Graph gives us access the following use cases:
- Interoperability with other data contract specification formats
- Store data contracts inside a knowledge graph
- Enhance a semantic search to find and retrieve data contracts
- Linking model elements to already established ontologies and knowledge
- Using full power of OWL to reason about the graph structure of data contracts
- Apply graph algorithms on multiple data contracts (Find similar data contracts, find "gatekeeper"
data products, find the true domain owner of a field attribute)
DBML
The export function converts the logical data types of the datacontract into the specific ones of a concrete Database
if a server is selected via the --server option (based on the type of that server). If no server is selected, the
logical data types are exported.
DBT & DBT-SOURCES
The export function converts the datacontract to dbt models in YAML format, with support for SQL dialects.
If a server is selected via the --server option (based on the type of that server) then the DBT column data_types match the expected data types of the server.
If no server is selected, then it defaults to snowflake.
Spark
The export function converts the data contract specification into a StructType Spark schema. The returned value is a Python code picture of the model schemas.
Spark DataFrame schema is defined as StructType. For more details about Spark Data Types please see the spark documentation
Avro
The export function converts the data contract specification into an avro schema. It supports specifying custom avro properties for logicalTypes and default values.
Custom Avro Properties
We support a config map on field level. A config map may include any additional key-value pairs and support multiple server type bindings.
To specify custom Avro properties in your data contract, you can define them within the config section of your field definition. Below is an example of how to structure your YAML configuration to include custom Avro properties, such as avroLogicalType and avroDefault.
NOTE: At this moment, we just support logicalType and default
Example Configuration
models:
orders:
fields:
my_field_1:
description: Example for AVRO with Timestamp (microsecond precision) https://avro.apache.org/docs/current/spec.html#Local+timestamp+%28microsecond+precision%29
type: long
example: 1672534861000000
required: true
config:
avroLogicalType: local-timestamp-micros
avroDefault: 1672534861000000
Explanation
- models: The top-level key that contains different models (tables or objects) in your data contract.
- orders: A specific model name. Replace this with the name of your model.
- fields: The fields within the model. Each field can have various properties defined.
- my_field_1: The name of a specific field. Replace this with your field name.
- description: A textual description of the field.
- type: The data type of the field. In this example, it is
long.
- example: An example value for the field.
- required: Is this a required field (as opposed to optional/nullable).
- config: Section to specify custom Avro properties.
- avroLogicalType: Specifies the logical type of the field in Avro. In this example, it is
local-timestamp-micros.
- avroDefault: Specifies the default value for the field in Avro. In this example, it is 1672534861000000 which corresponds to
2023-01-01 01:01:01 UTC.
Data Caterer
The export function converts the data contract to a data generation task in YAML format that can be
ingested by Data Caterer. This gives you the
ability to generate production-like data in any environment based off your data contract.
datacontract export datacontract.yaml --format data-caterer --model orders
You can further customise the way data is generated via adding
additional metadata in the YAML
to suit your needs.
Iceberg
Exports to an Iceberg Table Json Schema Definition.
This export only supports a single model export at a time because Iceberg's schema definition is for a single table and the exporter maps 1 model to 1 table, use the --model flag
to limit your contract export to a single model.
$ datacontract export --format iceberg --model orders https://datacontract.com/examples/orders-latest/datacontract.yaml --output /tmp/orders_iceberg.json
$ cat /tmp/orders_iceberg.json | jq '.'
{
"type": "struct",
"fields": [
{
"id": 1,
"name": "order_id",
"type": "string",
"required": true
},
{
"id": 2,
"name": "order_timestamp",
"type": "timestamptz",
"required": true
},
{
"id": 3,
"name": "order_total",
"type": "long",
"required": true
},
{
"id": 4,
"name": "customer_id",
"type": "string",
"required": false
},
{
"id": 5,
"name": "customer_email_address",
"type": "string",
"required": true
},
{
"id": 6,
"name": "processed_timestamp",
"type": "timestamptz",
"required": true
}
],
"schema-id": 0,
"identifier-field-ids": [
1
]
}
Custom
The export function converts the data contract specification into the custom format with Jinja. You can specify the path to a Jinja template with the --template argument, allowing you to output files in any format.
datacontract export --format custom --template template.txt datacontract.yaml
Jinja variables
You can directly use the Data Contract Specification as template variables.
$ cat template.txt
title: {{ data_contract.info.title }}
$ datacontract export --format custom --template template.txt datacontract.yaml
title: Orders Latest
Example Jinja Templates
Customized dbt model
You can export the dbt models containing any logic.
Below is an example of a dbt staging layer that converts a field of type: timestamp to a DATETIME type with time zone conversion.
template.sql
{% raw %}
{%- for model_name, model in data_contract.models.items() %}
{#- Export only the first model #}
{%- if loop.first -%}
SELECT
{%- for field_name, field in model.fields.items() %}
{%- if field.type == "timestamp" %}
DATETIME({{ field_name }}, "Asia/Tokyo") AS {{ field_name }},
{%- else %}
{{ field_name }} AS {{ field_name }},
{%- endif %}
{%- endfor %}
FROM
{{ "{{" }} ref('{{ model_name }}') {{ "}}" }}
{%- endif %}
{%- endfor %}
{% endraw %}
command
datacontract export --format custom --template template.sql --output output.sql datacontract.yaml
output.sql
SELECT
order_id AS order_id,
DATETIME(order_timestamp, "Asia/Tokyo") AS order_timestamp,
order_total AS order_total,
customer_id AS customer_id,
customer_email_address AS customer_email_address,
DATETIME(processed_timestamp, "Asia/Tokyo") AS processed_timestamp,
FROM
{{ ref('orders') }}
ODCS Excel Template
The export function converts a data contract into an ODCS (Open Data Contract Standard) Excel template. This creates a user-friendly Excel spreadsheet that can be used for authoring, sharing, and managing data contracts using the familiar Excel interface.
datacontract export --format excel --output datacontract.xlsx datacontract.yaml
The Excel format enables:
- User-friendly authoring: Create and edit data contracts in Excel's familiar interface
- Easy sharing: Distribute data contracts as standard Excel files
- Collaboration: Enable non-technical stakeholders to contribute to data contract definitions
- Round-trip conversion: Import Excel templates back to YAML data contracts
For more information about the Excel template structure, visit the ODCS Excel Template repository.
import
Usage: datacontract import [OPTIONS]
Create a data contract from the given source location. Saves to file specified by `output` option
if present, otherwise prints to stdout.
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --format [sql|avro|dbt|dbml|glue| The format of the source โ
โ jsonschema|json|bigquery file. โ
โ |odcs|unity|spark|iceber [default: None] โ
โ g|parquet|csv|protobuf|e [required] โ
โ xcel] โ
โ --output PATH Specify the file path โ
โ where the Data Contract โ
โ will be saved. If no path โ
โ is provided, the output โ
โ will be printed to stdout. โ
โ [default: None] โ
โ --source TEXT The path to the file that โ
โ should be imported. โ
โ [default: None] โ
โ --dialect TEXT The SQL dialect to use โ
โ when importing SQL files, โ
โ e.g., postgres, tsql, โ
โ bigquery. โ
โ [default: None] โ
โ --glue-table TEXT List of table ids to โ
โ import from the Glue โ
โ Database (repeat for โ
โ multiple table ids, leave โ
โ empty for all tables in โ
โ the dataset). โ
โ [default: None] โ
โ --bigquery-project TEXT The bigquery project id. โ
โ [default: None] โ
โ --bigquery-dataset TEXT The bigquery dataset id. โ
โ [default: None] โ
โ --bigquery-table TEXT List of table ids to โ
โ import from the bigquery โ
โ API (repeat for multiple โ
โ table ids, leave empty for โ
โ all tables in the โ
โ dataset). โ
โ [default: None] โ
โ --unity-table-full-name TEXT Full name of a table in โ
โ the unity catalog โ
โ [default: None] โ
โ --dbt-model TEXT List of models names to โ
โ import from the dbt โ
โ manifest file (repeat for โ
โ multiple models names, โ
โ leave empty for all models โ
โ in the dataset). โ
โ [default: None] โ
โ --dbml-schema TEXT List of schema names to โ
โ import from the DBML file โ
โ (repeat for multiple โ
โ schema names, leave empty โ
โ for all tables in the โ
โ file). โ
โ [default: None] โ
โ --dbml-table TEXT List of table names to โ
โ import from the DBML file โ
โ (repeat for multiple table โ
โ names, leave empty for all โ
โ tables in the file). โ
โ [default: None] โ
โ --iceberg-table TEXT Table name to assign to โ
โ the model created from the โ
โ Iceberg schema. โ
โ [default: None] โ
โ --template TEXT The location (url or path) โ
โ of the ODCS template โ
โ [default: None] โ
โ --schema TEXT The location (url or path) โ
โ of the ODCS JSON Schema โ
โ [default: None] โ
โ --owner TEXT The owner or team โ
โ responsible for managing โ
โ the data contract. โ
โ [default: None] โ
โ --id TEXT The identifier for the the โ
โ data contract. โ
โ [default: None] โ
โ --debug --no-debug Enable debug logging โ
โ [default: no-debug] โ
โ --help Show this message and โ
โ exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Example:
datacontract import --format sql --source my_ddl.sql --dialect postgres
datacontract import --format sql --source my_ddl.sql --dialect postgres --output datacontract.yaml
Available import options:
avro | Import from AVRO schemas | โ
|
bigquery | Import from BigQuery Schemas | โ
|
csv | Import from CSV File | โ
|
dbml | Import from DBML models | โ
|
dbt | Import from dbt models | โ
|
excel | Import from ODCS Excel Template | โ
|
glue | Import from AWS Glue DataCatalog | โ
|
iceberg | Import from an Iceberg JSON Schema Definition | partial |
jsonschema | Import from JSON Schemas | โ
|
parquet | Import from Parquet File Metadata | โ
|
protobuf | Import from Protobuf schemas | โ
|
spark | Import from Spark StructTypes, Variant | โ
|
sql | Import from SQL DDL | โ
|
unity | Import from Databricks Unity Catalog | partial |
excel | Import from ODCS Excel Template | โ
|
| Missing something? | Please create an issue on GitHub | TBD |
BigQuery
BigQuery data can either be imported off of JSON Files generated from the table descriptions or directly from the Bigquery API. In case you want to use JSON Files, specify the source parameter with a path to the JSON File.
To import from the Bigquery API, you have to omit source and instead need to provide bigquery-project and bigquery-dataset. Additionally you may specify bigquery-table to enumerate the tables that should be imported. If no tables are given, all available tables of the dataset will be imported.
For providing authentication to the Client, please see the google documentation or the one about authorizing client libraries.
Examples:
datacontract import --format bigquery --source my_bigquery_table.json
datacontract import --format bigquery --bigquery-project <project_id> --bigquery-dataset <dataset_id> --bigquery-table <tableid_1> --bigquery-table <tableid_2> --bigquery-table <tableid_3>
datacontract import --format bigquery --bigquery-project <project_id> --bigquery-dataset <dataset_id>
Unity Catalog
datacontract import --format unity --source my_unity_table.json
export DATACONTRACT_DATABRICKS_SERVER_HOSTNAME="https://xyz.cloud.databricks.com"
export DATACONTRACT_DATABRICKS_TOKEN=<token>
datacontract import --format unity --unity-table-full-name <table_full_name>
Please refer to Databricks documentation on how to set up a profile
export DATACONTRACT_DATABRICKS_PROFILE="my-profile"
datacontract import --format unity --unity-table-full-name <table_full_name>
dbt
Importing from dbt manifest file.
You may give the dbt-model parameter to enumerate the tables that should be imported. If no tables are given, all available tables of the database will be imported.
Examples:
datacontract import --format dbt --source <manifest_path> --dbt-model <model_name_1> --dbt-model <model_name_2> --dbt-model <model_name_3>
datacontract import --format dbt --source <manifest_path>
Excel
Importing from ODCS Excel Template.
Examples:
datacontract import --format excel --source odcs.xlsx
Glue
Importing from Glue reads the necessary Data directly off of the AWS API.
You may give the glue-table parameter to enumerate the tables that should be imported. If no tables are given, all available tables of the database will be imported.
Examples:
datacontract import --format glue --source <database_name> --glue-table <table_name_1> --glue-table <table_name_2> --glue-table <table_name_3>
datacontract import --format glue --source <database_name>
Spark
Importing from Spark table or view these must be created or accessible in the Spark context. Specify tables list in source parameter. If the source tables are registered as tables in Databricks, and they have a table-level descriptions they will also be added to the Data Contract Specification.
datacontract import --format spark --source "users,orders"
DataContract.import_from_source("spark", "users")
DataContract.import_from_source(format = "spark", source = "users")
DataContract.import_from_source("spark", "users", dataframe = df_user)
DataContract.import_from_source(format = "spark", source = "users", dataframe = df_user)
DataContract.import_from_source("spark", "users", description = "description")
DataContract.import_from_source(format = "spark", source = "users", description = "description")
DataContract.import_from_source("spark", "users", dataframe = df_user, description = "description")
DataContract.import_from_source(format = "spark", source = "users", dataframe = df_user, description = "description")
DBML
Importing from DBML Documents.
NOTE: Since DBML does not have strict requirements on the types of columns, this import may create non-valid datacontracts, as not all types of fields can be properly mapped. In this case you will have to adapt the generated document manually.
We also assume, that the description for models and fields is stored in a Note within the DBML model.
You may give the dbml-table or dbml-schema parameter to enumerate the tables or schemas that should be imported.
If no tables are given, all available tables of the source will be imported. Likewise, if no schema is given, all schemas are imported.
Examples:
datacontract import --format dbml --source <file_path>
datacontract import --format dbml --source <file_path> --dbml-schema <schema_1> --dbml-schema <schema_2>
datacontract import --format dbml --source <file_path> --dbml-table <table_name_1> --dbml-table <table_name_2>
datacontract import --format dbml --source <file_path> --dbml-table <table_name_1> --dbml-schema <schema_1>
Iceberg
Importing from an Iceberg Table Json Schema Definition. Specify location of json files using the source parameter.
Examples:
datacontract import --format iceberg --source ./tests/fixtures/iceberg/simple_schema.json --iceberg-table test-table
CSV
Importing from CSV File. Specify file in source parameter. It does autodetection for encoding and csv dialect
Example:
datacontract import --format csv --source "test.csv"
protobuf
Importing from protobuf File. Specify file in source parameter.
Example:
datacontract import --format protobuf --source "test.proto"
catalog
Usage: datacontract catalog [OPTIONS]
Create a html catalog of data contracts.
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --files TEXT Glob pattern for the data contract files to include in the โ
โ catalog. Applies recursively to any subfolders. โ
โ [default: *.yaml] โ
โ --output TEXT Output directory for the catalog html files. [default: catalog/] โ
โ --schema TEXT The location (url or path) of the ODCS JSON Schema โ
โ [default: None] โ
โ --debug --no-debug Enable debug logging [default: no-debug] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Examples:
# create a catalog right in the current folder
datacontract catalog --output "."
# Create a catalog based on a filename convention
datacontract catalog --files "*.odcs.yaml"
publish
Usage: datacontract publish [OPTIONS] [LOCATION]
Publish the data contract to the Entropy Data.
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ location [LOCATION] The location (url or path) of the data contract yaml. โ
โ [default: datacontract.yaml] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --schema TEXT The location (url or path) of the ODCS JSON โ
โ Schema โ
โ [default: None] โ
โ --ssl-verification --no-ssl-verification SSL verification when publishing the data โ
โ contract. โ
โ [default: ssl-verification] โ
โ --debug --no-debug Enable debug logging [default: no-debug] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
api
Usage: datacontract api [OPTIONS]
Start the datacontract CLI as server application with REST API.
The OpenAPI documentation as Swagger UI is available on http://localhost:4242. You can execute the
commands directly from the Swagger UI.
To protect the API, you can set the environment variable DATACONTRACT_CLI_API_KEY to a secret API
key. To authenticate, requests must include the header 'x-api-key' with the correct API key. This
is highly recommended, as data contract tests may be subject to SQL injections or leak sensitive
information.
To connect to servers (such as a Snowflake data source), set the credentials as environment
variables as documented in https://cli.datacontract.com/#test
It is possible to run the API with extra arguments for `uvicorn.run()` as keyword arguments, e.g.:
`datacontract api --port 1234 --root_path /datacontract`.
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --port INTEGER Bind socket to this port. [default: 4242] โ
โ --host TEXT Bind socket to this host. Hint: For running in docker, set it โ
โ to 0.0.0.0 โ
โ [default: 127.0.0.1] โ
โ --debug --no-debug Enable debug logging [default: no-debug] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Integrations
Integration with Entropy Data
If you use Entropy Data, you can use the data contract URL to reference to the contract and append the --publish option to send and display the test results. Set an environment variable for your API key.
$ EXPORT ENTROPY_DATA_API_KEY=xxx
$ datacontract test https://demo.entropy-data.com/demo279750347121/datacontracts/4df9d6ee-e55d-4088-9598-b635b2fdcbbc/datacontract.yaml \
--server production \
--publish https://api.entropy-data.com/api/test-results
Best Practices
We share best practices in using the Data Contract CLI.
Data-first Approach
Create a data contract based on the actual data. This is the fastest way to get started and to get feedback from the data consumers.
-
Use an existing physical schema (e.g., SQL DDL) as a starting point to define your logical data model in the contract. Double check right after the import whether the actual data meets the imported logical data model. Just to be sure.
$ datacontract import --format sql --source ddl.sql
$ datacontract test
-
Add quality checks and additional type constraints one by one to the contract and make sure the
data still adheres to the contract.
$ datacontract test
-
Validate that the datacontract.yaml is correctly formatted and adheres to the Data Contract Specification.
$ datacontract lint
-
Set up a CI pipeline that executes daily for continuous quality checks. You can also report the
test results to tools like Data Mesh Manager
$ datacontract test --publish https://api.datamesh-manager.com/api/test-results
Contract-First
Create a data contract based on the requirements from use cases.
-
Start with a datacontract.yaml template.
$ datacontract init
-
Create the model and quality guarantees based on your business requirements. Fill in the terms,
descriptions, etc. Validate that your datacontract.yaml is correctly formatted.
$ datacontract lint
-
Use the export function to start building the providing data product as well as the integration
into the consuming data products.
$ datacontract export --format dbt
$ datacontract export --format dbt-sources
$ datacontract export --format dbt-staging-sql
-
Test that your data product implementation adheres to the contract.
$ datacontract test
Customizing Exporters and Importers
Custom Exporter
Using the exporter factory to add a new custom exporter
from datacontract.data_contract import DataContract
from datacontract.export.exporter import Exporter
from datacontract.export.exporter_factory import exporter_factory
class CustomExporter(Exporter):
def export(self, data_contract, model, server, sql_server_type, export_args) -> dict:
result = {
"title": data_contract.info.title,
"version": data_contract.info.version,
"description": data_contract.info.description,
"email": data_contract.info.contact.email,
"url": data_contract.info.contact.url,
"model": model,
"model_columns": ", ".join(list(data_contract.models.get(model).fields.keys())),
"export_args": export_args,
"custom_args": export_args.get("custom_arg", ""),
}
return result
exporter_factory.register_exporter("custom_exporter", CustomExporter)
if __name__ == "__main__":
data_contract = DataContract(
data_contract_file="/path/datacontract.yaml"
)
result = data_contract.export(
export_format="custom_exporter", model="orders", server="production", custom_arg="my_custom_arg"
)
print(result)
Output
{
'title': 'Orders Unit Test',
'version': '1.0.0',
'description': 'The orders data contract',
'email': 'team-orders@example.com',
'url': 'https://wiki.example.com/teams/checkout',
'model': 'orders',
'model_columns': 'order_id, order_total, order_status',
'export_args': {'server': 'production', 'custom_arg': 'my_custom_arg'},
'custom_args': 'my_custom_arg'
}
Custom Importer
Using the importer factory to add a new custom importer
from datacontract.model.data_contract_specification import DataContractSpecification, Field, Model
from datacontract.data_contract import DataContract
from datacontract.imports.importer import Importer
from datacontract.imports.importer_factory import importer_factory
import json
class CustomImporter(Importer):
def import_source(
self, data_contract_specification: DataContractSpecification, source: str, import_args: dict
) -> dict:
source_dict = json.loads(source)
data_contract_specification.id = source_dict.get("id_custom")
data_contract_specification.info.title = source_dict.get("title")
data_contract_specification.info.version = source_dict.get("version")
data_contract_specification.info.description = source_dict.get("description_from_app")
for model in source_dict.get("models", []):
fields = {}
for column in model.get('columns'):
field = Field(
description=column.get('column_description'),
type=column.get('type')
)
fields[column.get('name')] = field
dc_model = Model(
description=model.get('description'),
fields= fields
)
data_contract_specification.models[model.get('name')] = dc_model
return data_contract_specification
importer_factory.register_importer("custom_company_importer", CustomImporter)
if __name__ == "__main__":
json_from_custom_app = '''
{
"id_custom": "uuid-custom",
"version": "0.0.2",
"title": "my_custom_imported_data",
"description_from_app": "Custom contract description",
"models": [
{
"name": "model1",
"description": "model description from app",
"columns": [
{
"name": "columnA",
"type": "varchar",
"column_description": "my_column description"
},
{
"name": "columnB",
"type": "varchar",
"column_description": "my_columnB description"
}
]
}
]
}
'''
data_contract = DataContract()
result = data_contract.import_from_source(
format="custom_company_importer",
data_contract_specification=DataContract.init(),
source=json_from_custom_app
)
print(result.to_yaml() )
Output
dataContractSpecification: 1.2.1
id: uuid-custom
info:
title: my_custom_imported_data
version: 0.0.2
description: Custom contract description
models:
model1:
fields:
columnA:
type: varchar
description: my_column description
columnB:
type: varchar
description: my_columnB description
Development Setup
- Install uv
- Python base interpreter should be 3.11.x.
- Docker engine must be running to execute the tests.
uv python pin 3.11
uv venv
uv pip install -e '.[dev]'
uv run ruff check
uv run pytest
Troubleshooting
Windows: Some tests fail
Run in wsl. (We need to fix the paths in the tests so that normal Windows will work, contributions are appreciated)
PyCharm does not pick up the .venv
This uv issue might be relevant.
Try to sync all groups:
uv sync --all-groups --all-extras
Errors in tests that use PySpark (e.g. test_test_kafka.py)
Ensure you have a JDK 17 or 21 installed. Java 25 causes issues.
java --version
Docker Build
docker build -t datacontract/cli .
docker run --rm -v ${PWD}:/home/datacontract datacontract/cli
Docker compose integration
We've included a docker-compose.yml configuration to simplify the build, test, and deployment of the image.
Building the Image with Docker Compose
To build the Docker image using Docker Compose, run the following command:
docker compose build
This command utilizes the docker-compose.yml to build the image, leveraging predefined settings such as the build context and Dockerfile location. This approach streamlines the image creation process, avoiding the need for manual build specifications each time.
Testing the Image
After building the image, you can test it directly with Docker Compose:
docker compose run --rm datacontract --version
This command runs the container momentarily to check the version of the datacontract CLI. The --rm flag ensures that the container is automatically removed after the command executes, keeping your environment clean.
Release Steps
- Update the version in
pyproject.toml
- Have a look at the
CHANGELOG.md
- Create release commit manually
- Execute
./release
- Wait until GitHub Release is created
- Add the release notes to the GitHub Release
Contribution
We are happy to receive your contributions. Propose your change in an issue or directly create a pull request with your improvements.
Companies using this tool
Related Tools
- Entropy Data is a commercial tool to manage data contracts. It contains a web UI, access management, and data governance for a data product marketplace based on data contracts.
- Data Contract Editor is an editor for Data Contracts, including a live html preview.
- Data Contract Playground allows you to validate and export your data contract to different formats within your browser.
License
MIT License
Credits
Created by Stefan Negele, Jochen Christ, and Simon Harrer.
.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}