
Security News
Meet Socket at Black Hat and DEF CON 2025 in Las Vegas
Meet Socket at Black Hat & DEF CON 2025 for 1:1s, insider security talks at Allegiant Stadium, and a private dinner with top minds in software supply chain security.
Data testing, monitoring and profiling for SQL accessible data.
What does Soda SQL do?
Soda SQL allows you to
Why Soda SQL?
To protect against silent data issues for the consumers of your data, it's best-practice to profile and test your data:
This way you will prevent delivery of bad data to downstream consumers. You will spend less time firefighting and gain a better reputation.
How does Soda SQL work?
Soda SQL is a Command Line Interface (CLI) and a Python library to measure and test your data using SQL.
As input, Soda SQL uses YAML configuration files that include:
Based on those configuration files, Soda SQL will perform scans. A scan performs all measurements and runs all tests associated with one table. Typically a scan is executed after new data has arrived. All soda-sql configuration files can be checked into your version control system as part of your pipeline code.
Want to try Soda SQL? Head over to our 'Quick start tutorial' and get started straight away!
Let's walk through an example. Simple metrics and tests can be configured in scan YAML configuration files. An example of the contents of such a file:
metrics:
- row_count
- missing_count
- missing_percentage
- values_count
- values_percentage
- valid_count
- valid_percentage
- invalid_count
- invalid_percentage
- min
- max
- avg
- sum
- min_length
- max_length
- avg_length
- distinct
- unique_count
- duplicate_count
- uniqueness
- maxs
- mins
- frequent_values
- histogram
columns:
ID:
metrics:
- distinct
- duplicate_count
valid_format: uuid
tests:
duplicate_count == 0
CATEGORY:
missing_values:
- N/A
- No category
tests:
missing_percentage < 3
SIZE:
tests:
max - min < 20
sql_metrics:
- sql: |
SELECT sum(volume) as total_volume_us
FROM CUSTOMER_TRANSACTIONS
WHERE country = 'US'
tests:
- total_volume_us > 5000
Based on these configuration files, Soda SQL will scan your data each time new data arrived like this:
$ soda scan ./soda/metrics my_warehouse my_dataset
Soda 1.0 scan for dataset my_dataset on prod my_warehouse
| SELECT column_name, data_type, is_nullable
| FROM information_schema.columns
| WHERE lower(table_name) = 'customers'
| AND table_catalog = 'datasource.database'
| AND table_schema = 'datasource.schema'
- 0.256 seconds
Found 4 columns: ID, NAME, CREATE_DATE, COUNTRY
| SELECT
| COUNT(*),
| COUNT(CASE WHEN ID IS NULL THEN 1 END),
| COUNT(CASE WHEN ID IS NOT NULL AND ID regexp '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' THEN 1 END),
| MIN(LENGTH(ID)),
| AVG(LENGTH(ID)),
| MAX(LENGTH(ID)),
| FROM customers
- 0.557 seconds
row_count : 23543
missing : 23
invalid : 0
min_length: 9
avg_length: 9
max_length: 9
...more queries...
47 measurements computed
23 tests executed
All is good. No tests failed. Scan took 23.307 seconds
The next step is to add Soda SQL scans in your favorite data pipeline orchestration solution like:
If you like the goals of this project, encourage us! Star sodadata/soda-sql on Github.
Next, head over to our 'Quick start tutorial' and get your first project going!
FAQs
Soda SQL library & CLI
We found that soda-sql-core demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Meet Socket at Black Hat & DEF CON 2025 for 1:1s, insider security talks at Allegiant Stadium, and a private dinner with top minds in software supply chain security.
Security News
CAI is a new open source AI framework that automates penetration testing tasks like scanning and exploitation up to 3,600× faster than humans.
Security News
Deno 2.4 brings back bundling, improves dependency updates and telemetry, and makes the runtime more practical for real-world JavaScript projects.