
Security News
Axios Supply Chain Attack Reaches OpenAI macOS Signing Pipeline, Forces Certificate Rotation
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.
baselinr
Advanced tools
Baselinr is an open-source data quality and observability platform for SQL data warehouses. It automatically recommends which tables and columns to monitor, sets up data quality checks, profiles your data, detects drift and anomalies, and provides AI-powered root cause analysis—all with a self-improving feedback loop that learns from your data patterns. Use automation for zero-touch setup, or configure everything manually—you have full control and transparency.
Quality Studio - Baselinr includes a powerful no-code web interface for setting up and managing your entire data quality configuration. Configure connections, tables, profiling settings, validation rules, drift detection, and more—all through an intuitive visual interface without writing YAML or JSON. The UI provides real-time validation, smart recommendations, and a visual editor with YAML preview, making it easy to get started while maintaining full transparency and control.
🎮 Try Quality Studio Demo → - Experience the Quality Studio with realistic sample data showcasing all features.
Install Baselinr directly from PyPI:
pip install baselinr
Baselinr supports optional dependencies for enhanced functionality:
Snowflake Support:
pip install baselinr[snowflake]
Dagster Integration:
pip install baselinr[dagster]
Airflow Integration:
pip install baselinr[airflow]
All Features:
pip install baselinr[all]
For local development, clone the repository and install in editable mode:
git clone https://github.com/baselinrhq/baselinr.git
cd baselinr
pip install -e ".[dev]"
All documentation has been organized into the docs/ directory:
See docs/README.md for the complete documentation index.
Create a config.yml file:
environment: development
source:
type: postgres
host: localhost
port: 5432
database: mydb
username: user
password: password
schema: public
storage:
connection:
type: postgres
host: localhost
port: 5432
database: mydb
username: user
password: password
results_table: baselinr_results
runs_table: baselinr_runs
create_tables: true
enable_expectation_learning: true # Learn expected ranges automatically
learning_window_days: 30 # Use last 30 days of data
min_samples: 5 # Require at least 5 historical runs
enable_anomaly_detection: true # Detect anomalies using learned expectations
profiling:
tables:
# Explicit table selection (highest priority)
- table: customers
schema: public
# Pattern-based selection (wildcard)
- pattern: "user_*"
schema: public
# Matches: user_profile, user_settings, user_preferences, etc.
# Schema-based selection (all tables in schema)
- select_schema: true
schema: analytics
exclude_patterns:
- "*_temp"
- "*_backup"
# Regex pattern matching
- pattern: "^(customer|order)_\\d{4}$"
pattern_type: regex
schema: public
# Matches: customer_2024, order_2024, etc.
# Multi-database profiling (optional database field)
# - table: users
# schema: public
# database: analytics_db # Profile from analytics_db instead of source.database
# - pattern: "order_*"
# schema: public
# database: warehouse_db # Profile matching tables from warehouse_db
# - select_schema: true
# schema: analytics
# database: production_db # Profile all tables in analytics schema from production_db
# Discovery options for pattern-based selection
discovery_options:
max_tables_per_pattern: 1000
max_schemas_per_database: 100
cache_discovery: true
validate_regex: true
default_sample_ratio: 1.0
compute_histograms: true
histogram_bins: 10
baselinr plan --config config.yml
This shows you what tables will be profiled without actually running the profiler.
baselinr profile --config config.yml
After running profiling multiple times:
baselinr drift --config config.yml --dataset customers
Execute validation rules to check data quality:
# Run all validation rules
baselinr validate --config config.yml
# Validate specific table
baselinr validate --config config.yml --table customers
# Save results to JSON file
baselinr validate --config config.yml --output validation_results.json
Query your profiling history and drift events:
# List recent profiling runs
baselinr query runs --config config.yml --limit 10
# Query drift events
baselinr query drift --config config.yml --table customers --days 7
# Get detailed run information
baselinr query run --config config.yml --run-id <run-id>
# View table profiling history
baselinr query table --config config.yml --table customers --days 30
Get a quick overview of recent runs and active drift:
# Show status dashboard
baselinr status --config config.yml
# Show only drift summary
baselinr status --config config.yml --drift-only
# Watch mode (auto-refresh)
baselinr status --config config.yml --watch
# JSON output for scripting
baselinr status --config config.yml --json
Launch the Quality Studio web interface to configure your data quality setup, view profiling runs, drift alerts, and metrics:
🎮 Try the Demo → - Experience Quality Studio with sample data (no installation required)
# Start Quality Studio (foreground mode)
baselinr ui --config config.yml
# Custom ports
baselinr ui --config config.yml --port-backend 8080 --port-frontend 3001
# Localhost only
baselinr ui --config config.yml --host 127.0.0.1
Press Ctrl+C to stop the Quality Studio. See docs/schemas/UI_COMMAND.md for more details.
The Quality Studio provides a no-code interface for:
Check and apply schema migrations:
# Check schema version status
baselinr migrate status --config config.yml
# Apply migrations to latest version
baselinr migrate apply --config config.yml --target 1
# Validate schema integrity
baselinr migrate validate --config config.yml
Baselinr includes a complete Docker environment for local development and testing.
cd docker
docker-compose up -d
This will start:
cd docker
docker-compose down
Baselinr computes the following metrics:
See docs/guides/PROFILING_ENRICHMENT.md for detailed documentation on enrichment features.
Baselinr can automatically learn expected metric ranges from historical profiling data, creating statistical models that help identify outliers without explicit thresholds.
Expectation learning analyzes historical profiling data over a configurable window (default: 30 days) to compute:
These learned expectations are automatically updated after each profiling run, providing an evolving model of what "normal" looks like for your data.
Enable expectation learning in your config.yml:
storage:
enable_expectation_learning: true
learning_window_days: 30 # Historical window in days
min_samples: 5 # Minimum runs required for learning
ewma_lambda: 0.2 # EWMA smoothing parameter (0 < lambda <= 1)
See docs/guides/EXPECTATION_LEARNING.md for comprehensive documentation on expectation learning.
Baselinr can create Dagster assets dynamically from your configuration:
from baselinr.integrations.dagster import build_baselinr_definitions
defs = build_baselinr_definitions(
config_path="config.yml",
asset_prefix="baselinr",
job_name="baselinr_profile_all",
enable_sensor=True, # optional
)
Baselinr provides comprehensive integration with dbt for scalable profiling and drift detection.
Reference dbt models directly in your baselinr configuration:
profiling:
tables:
- dbt_ref: customers
dbt_project_path: ./dbt_project
- dbt_selector: tag:critical
dbt_project_path: ./dbt_project
Use dbt model references and selectors in baselinr configs:
# baselinr_config.yml
profiling:
tables:
- dbt_ref: customers
dbt_project_path: ./dbt_project
- dbt_selector: tag:critical
dbt_project_path: ./dbt_project
Installation:
pip install baselinrprofiling:
tables:
- dbt_ref: customers
dbt_project_path: ./dbt_project
- dbt_selector: tag:critical
dbt_project_path: ./dbt_project
dbt compile or dbt run to generate manifest.jsonbaselinr profile --config baselinr_config.ymlNote: dbt hooks can only execute SQL, not Python scripts. Run profiling after
dbt runusing an orchestrator or manually.
See dbt Integration Guide for complete documentation.
Baselinr provides comprehensive integration with Apache Airflow 2.x for orchestration and scheduling.
from airflow import DAG
from baselinr.integrations.airflow import BaselinrProfileOperator
dag = DAG("baselinr_profiling", ...)
profile_task = BaselinrProfileOperator(
task_id="profile_tables",
config_path="/path/to/config.yml",
dag=dag,
)
See Airflow Integration Guide and Quick Start for complete documentation.
Baselinr provides a high-level Python SDK for programmatic access to all functionality.
from baselinr import BaselinrClient
# Initialize client
client = BaselinrClient(config_path="config.yml")
# Build execution plan
plan = client.plan()
print(f"Will profile {plan.total_tables} tables")
# Profile tables
results = client.profile()
for result in results:
print(f"Profiled {result.dataset_name}: {len(result.columns)} columns")
# Detect drift
drift_report = client.detect_drift("customers")
print(f"Found {len(drift_report.column_drifts)} column drifts")
# Query recent runs
runs = client.query_runs(days=7, limit=10)
# Get status summary
status = client.get_status()
print(f"Active drift events: {len(status['drift_summary'])}")
BaselinrClient classFor complete SDK documentation including all methods, parameters, and advanced patterns, see the Python SDK Guide.
baselinr/
├── baselinr/ # Main package
│ ├── config/ # Configuration management
│ ├── connectors/ # Database connectors
│ ├── profiling/ # Profiling engine
│ ├── storage/ # Results storage
│ ├── drift/ # Drift detection
│ ├── learning/ # Expectation learning
│ ├── anomaly/ # Anomaly detection
│ ├── integrations/
│ │ └── dagster/ # Dagster assets & sensors
│ └── cli.py # CLI interface
├── examples/ # Example configurations
│ ├── config.yml # PostgreSQL example
│ ├── config_sqlite.yml # SQLite example
│ ├── config_mysql.yml # MySQL example
│ ├── config_bigquery.yml # BigQuery example
│ ├── config_redshift.yml # Redshift example
│ ├── config_with_metrics.yml # Metrics example
│ ├── config_slack_alerts.yml # Slack alerts example
│ ├── dagster_repository.py
│ └── quickstart.py
├── docker/ # Docker environment
│ ├── docker-compose.yml
│ ├── Dockerfile
│ ├── init_postgres.sql
│ ├── dagster.yaml
│ └── workspace.yaml
├── setup.py
├── requirements.txt
└── README.md
python examples/quickstart.py
# View profiling plan (dry-run)
baselinr plan --config examples/config.yml
# View plan in JSON format
baselinr plan --config examples/config.yml --output json
# View plan with verbose details
baselinr plan --config examples/config.yml --verbose
# Profile all tables in config
baselinr profile --config examples/config.yml
# Profile with output to JSON
baselinr profile --config examples/config.yml --output results.json
# Dry run (don't write to storage)
baselinr profile --config examples/config.yml --dry-run
# Detect drift
baselinr drift --config examples/config.yml --dataset customers
# Detect drift with specific runs
baselinr drift --config examples/config.yml \
--dataset customers \
--baseline <run-id-1> \
--current <run-id-2>
# Fail on critical drift (useful for CI/CD)
baselinr drift --config examples/config.yml \
--dataset customers \
--fail-on-drift
# Use statistical tests for advanced drift detection
# (configure in config.yml: strategy: statistical)
# Query profiling runs
baselinr query runs --config examples/config.yml --limit 10
# Query drift events for a table
baselinr query drift --config examples/config.yml \
--table customers \
--severity high \
--days 7
# Get detailed run information
baselinr query run --config examples/config.yml \
--run-id <run-id> \
--format json
# View table profiling history
baselinr query table --config examples/config.yml \
--table customers \
--days 30 \
--format csv \
--output history.csv
# Check system status
baselinr status --config examples/config.yml
# Watch status (auto-refresh)
baselinr status --config examples/config.yml --watch
# Status with JSON output
baselinr status --config examples/config.yml --json
# Start Quality Studio
baselinr ui --config examples/config.yml
# Check schema migration status
baselinr migrate status --config examples/config.yml
# Apply schema migrations
baselinr migrate apply --config examples/config.yml --target 1
# Validate schema integrity
baselinr migrate validate --config examples/config.yml
Baselinr provides multiple drift detection strategies and intelligent baseline selection:
Absolute Threshold (default): Simple percentage-based thresholds
Standard Deviation: Statistical significance based on standard deviations
Statistical Tests (advanced): Multiple statistical tests for rigorous detection
Baselinr automatically selects the optimal baseline for drift detection based on column characteristics:
Thresholds and baseline selection are fully configurable via the drift_detection configuration. See docs/guides/DRIFT_DETECTION.md for general drift detection and docs/guides/STATISTICAL_DRIFT_DETECTION.md for statistical tests.
Baselinr includes a pluggable event system that emits events for drift detection, schema changes, and profiling lifecycle events. You can register hooks to process these events for logging, persistence, or alerting.
hooks:
enabled: true
hooks:
# Log all events
- type: logging
log_level: INFO
# Persist to database
- type: sql
table_name: baselinr_events
connection:
type: postgres
host: localhost
database: monitoring
username: user
password: pass
Create custom hooks by implementing the AlertHook protocol:
from baselinr.events import BaseEvent
class MyCustomHook:
def handle_event(self, event: BaseEvent) -> None:
# Process the event
print(f"Event: {event.event_type}")
Configure custom hooks:
hooks:
enabled: true
hooks:
- type: custom
module: my_hooks
class_name: MyCustomHook
params:
webhook_url: https://api.example.com/alerts
See docs/architecture/EVENTS_AND_HOOKS.md for comprehensive documentation and examples.
Baselinr includes a built-in schema versioning system to manage database schema evolution safely.
# Check current schema version status
baselinr migrate status --config config.yml
# Apply migrations to a specific version
baselinr migrate apply --config config.yml --target 1
# Preview migrations (dry run)
baselinr migrate apply --config config.yml --target 1 --dry-run
# Validate schema integrity
baselinr migrate validate --config config.yml
baselinr_schema_version tableBaselinr provides powerful querying capabilities to explore your profiling history and drift events.
# Query profiling runs with filters
baselinr query runs --config config.yml \
--table customers \
--status completed \
--days 30 \
--limit 20 \
--format table
# Query drift events
baselinr query drift --config config.yml \
--table customers \
--severity high \
--days 7 \
--format json
# Get detailed information about a specific run
baselinr query run --config config.yml \
--run-id abc123-def456 \
--format json
# View table profiling history over time
baselinr query table --config config.yml \
--table customers \
--schema public \
--days 90 \
--format csv \
--output history.csv
All query commands support multiple output formats:
source:
type: postgres | snowflake | sqlite | mysql | bigquery | redshift
host: hostname
port: 5432
database: database_name
username: user
password: password
schema: schema_name # Optional
# Snowflake-specific
account: snowflake_account
warehouse: warehouse_name
role: role_name
# SQLite-specific
filepath: /path/to/database.db
# BigQuery-specific (credentials via extra_params)
extra_params:
credentials_path: /path/to/service-account-key.json
# Or use GOOGLE_APPLICATION_CREDENTIALS environment variable
# MySQL-specific
# Uses standard host/port/database/username/password
# Redshift-specific
# Uses standard host/port/database/username/password
# Default port: 5439
profiling:
# Table discovery and pattern-based selection
table_discovery: true # Enable automatic table discovery
discovery_options:
max_tables_per_pattern: 1000 # Limit matches per pattern
max_schemas_per_database: 100 # Limit schemas to scan
validate_regex: true # Validate regex patterns at config load time
tag_provider: auto # Tag metadata provider: auto, snowflake, bigquery, postgres, mysql, redshift, sqlite, dbt
tables:
# Explicit table selection (highest priority)
- table: table_name
schema: schema_name # Optional
# Pattern-based selection (wildcard)
- pattern: "user_*"
schema: public
# Matches all tables starting with "user_"
# Regex pattern matching
- pattern: "^(customer|order)_\\d{4}$"
pattern_type: regex
schema: public
# Matches: customer_2024, order_2024, etc.
# Schema-based selection (all tables in schema)
- select_schema: true
schema: analytics
exclude_patterns:
- "*_temp"
- "*_backup"
# Database-level selection (all schemas)
- select_all_schemas: true
exclude_schemas:
- "information_schema"
- "pg_catalog"
# Multi-database profiling (optional database field)
# When database is specified, the pattern operates on that database
# When omitted, uses config.source.database (backward compatible)
# - table: customers
# schema: public
# database: analytics_db
# - select_all_schemas: true
# database: staging_db # Profile all schemas in staging_db
# Tag-based selection
- tags:
- "data_quality:critical"
- "domain:customer"
schema: public
# Precedence override (explicit table overrides pattern)
- pattern: "events_*"
schema: analytics
override_priority: 10
- table: events_critical
schema: analytics
override_priority: 100 # Higher priority overrides pattern
default_sample_ratio: 1.0
max_distinct_values: 1000
compute_histograms: true # Enable for statistical tests
histogram_bins: 10
metrics:
- count
- null_count
- null_ratio
- distinct_count
- unique_ratio
- approx_distinct_count
- min
- max
- mean
- stddev
- histogram
- data_type_inferred
drift_detection:
# Strategy: absolute_threshold | standard_deviation | statistical
strategy: absolute_threshold
# Absolute threshold (default)
absolute_threshold:
low_threshold: 5.0
medium_threshold: 15.0
high_threshold: 30.0
# Baseline auto-selection configuration
baselines:
strategy: auto # auto | last_run | moving_average | prior_period | stable_window
windows:
moving_average: 7 # Number of runs for moving average
prior_period: 7 # Days for prior period (1=day, 7=week, 30=month)
min_runs: 3 # Minimum runs required for auto-selection
# Statistical tests (advanced)
# statistical:
# tests:
# - ks_test
# - psi
# - z_score
# - chi_square
# - entropy
# - top_k
# sensitivity: medium
# test_params:
# ks_test:
# alpha: 0.05
# psi:
# buckets: 10
# threshold: 0.2
storage:
# Enable automatic learning of expected metric ranges
enable_expectation_learning: true
# Historical window in days for learning expectations
learning_window_days: 30
# Minimum number of historical runs required for learning
min_samples: 5
# EWMA smoothing parameter for control limits (0 < lambda <= 1)
# Lower values = more smoothing (0.1-0.3 recommended)
ewma_lambda: 0.2
storage:
# Enable automatic anomaly detection using learned expectations
enable_anomaly_detection: true
# List of enabled detection methods (default: all methods)
anomaly_enabled_methods:
- control_limits
- iqr
- mad
- ewma
- seasonality
- regime_shift
# IQR multiplier threshold for outlier detection
anomaly_iqr_threshold: 1.5
# MAD threshold (modified z-score) for outlier detection
anomaly_mad_threshold: 3.0
# EWMA deviation threshold (number of stddevs)
anomaly_ewma_deviation_threshold: 2.0
# Enable trend and seasonality detection
anomaly_seasonality_enabled: true
# Enable regime shift detection
anomaly_regime_shift_enabled: true
# Number of recent runs for regime shift comparison
anomaly_regime_shift_window: 3
# P-value threshold for regime shift detection
anomaly_regime_shift_sensitivity: 0.05
Baselinr supports environment variable overrides:
# Override source connection
export BASELINR_SOURCE__HOST=prod-db.example.com
export BASELINR_SOURCE__PASSWORD=secret
# Override environment
export BASELINR_ENVIRONMENT=production
# Run profiling
baselinr profile --config config.yml
pytest
black baselinr/
isort baselinr/
mypy baselinr/
Apache License 2.0 - see LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
For questions and support, please open an issue on GitHub.
Baselinr - Modern data profiling made simple 🧩
FAQs
Modern data profiling and drift detection framework
We found that baselinr demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.

Security News
Open source is under attack because of how much value it creates. It has been the foundation of every major software innovation for the last three decades. This is not the time to walk away from it.

Security News
Socket CEO Feross Aboukhadijeh breaks down how North Korea hijacked Axios and what it means for the future of software supply chain security.