New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details
Socket
Book a DemoSign in
Socket

truthound

Package Overview
Dependencies
Maintainers
1
Versions
40
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

truthound - pypi Package Compare versions

Comparing version
1.3.2
to
1.5.0
+97
docs/cli/core/read.md
# truthound read
Read and preview data from files or database connections. Supports row/column selection, multiple output formats, and schema inspection.
## Synopsis
```bash
truthound read [FILE] [OPTIONS]
```
## Arguments
| Argument | Required | Description |
|----------|----------|-------------|
| `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON) |
## Data Source Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--connection` | `--conn` | None | Database connection string |
| `--table` | | None | Database table name |
| `--query` | | None | SQL query (alternative to `--table`) |
| `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) |
| `--source-name` | | None | Custom label for the data source |
## Selection Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--head` | `-n` | None | Show only the first N rows |
| `--sample` | `-s` | None | Random sample of N rows |
| `--columns` | `-c` | None | Columns to include (comma-separated) |
## Output Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--format` | `-f` | `table` | Output format (table, csv, json, parquet, ndjson) |
| `--output` | `-o` | None | Output file path |
## Inspection Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--schema-only` | | `false` | Show only column names and types |
| `--count-only` | | `false` | Show only the row count |
## Examples
### Basic Reading
```bash
truthound read data.csv
truthound read data.parquet --head 20
truthound read data.csv --columns id,name,age
```
### Database Reading
```bash
truthound read --connection "postgresql://user:pass@host/db" --table users
truthound read --connection "sqlite:///data.db" --table orders --head 10
truthound read --source-config db.yaml --sample 1000
```
### Schema Inspection
```bash
truthound read data.csv --schema-only
truthound read --connection "postgresql://host/db" --table users --schema-only
```
### Format Conversion
```bash
truthound read data.csv --format json -o output.json
truthound read data.csv --format parquet -o output.parquet
truthound read data.csv --format csv --head 100
```
### Row Count
```bash
truthound read data.csv --count-only
```
## Related Commands
- [`check`](check.md) - Validate data quality
- [`profile`](profile.md) - Generate data profile
- [`learn`](learn.md) - Learn schema from data
## See Also
- [Python API: th.read()](../../python-api/core-functions.md#thread)
- [Data Source Options](../../guides/datasources/cli-datasource-guide.md)
# CLI Data Source Guide
All Truthound CLI commands support reading data from databases and external sources in addition to local files. This guide covers the shared data source options available across all core commands.
## Overview
Truthound CLI commands accept data from three input modes:
1. **File mode** (default): Pass a file path as a positional argument
2. **Connection string mode**: Use `--connection` and `--table` (or `--query`) to connect to a database
3. **Source config mode**: Use `--source-config` to load connection details from a JSON or YAML file
These modes are mutually exclusive. If a file argument is provided alongside connection options, the file takes precedence.
## Data Source Options
The following options are available on all core commands (`check`, `scan`, `mask`, `profile`, `learn`, `compare`, `read`):
| Option | Short | Description |
|--------|-------|-------------|
| `--connection` | `--conn` | Database connection string (see formats below) |
| `--table` | | Database table name to read |
| `--query` | | SQL query to execute (alternative to `--table`) |
| `--source-config` | `--sc` | Path to a data source config file (JSON or YAML) |
| `--source-name` | | Custom label for the data source (used in reports) |
## Connection String Formats
### PostgreSQL
```
postgresql://user:password@host:5432/dbname
```
Install the PostgreSQL backend:
```bash
pip install truthound[postgresql]
```
### MySQL
```
mysql://user:password@host:3306/dbname
```
Install the MySQL backend:
```bash
pip install truthound[mysql]
```
### SQLite
```
sqlite:///path/to/database.db
sqlite:///./relative/path.db
```
SQLite is included by default; no extra install is needed.
### DuckDB
```
duckdb:///path/to/database.duckdb
duckdb:///:memory:
```
Install the DuckDB backend:
```bash
pip install truthound[duckdb]
```
### Microsoft SQL Server
```
mssql://user:password@host:1433/dbname
```
Install the SQL Server backend:
```bash
pip install truthound[mssql]
```
## Source Config File Format
For repeatable or complex connection setups, use a source config file with `--source-config`.
### JSON Example
```json
{
"type": "postgresql",
"connection": "postgresql://user:password@host:5432/dbname",
"table": "users",
"source_name": "production-users"
}
```
### YAML Example
```yaml
type: postgresql
connection: postgresql://user:password@host:5432/dbname
table: users
source_name: production-users
```
### Using a SQL Query
```yaml
type: postgresql
connection: postgresql://user:password@host:5432/dbname
query: "SELECT id, name, email FROM users WHERE active = true"
source_name: active-users
```
## Dual-Source Config (for `compare`)
The `compare` command accepts two data sources. You can provide a source config file that defines both `baseline` and `current`:
```yaml
baseline:
type: postgresql
connection: postgresql://user:pass@host/db
table: users_baseline
current:
type: postgresql
connection: postgresql://user:pass@host/db
table: users_current
```
Usage:
```bash
truthound compare --source-config compare_sources.yaml --method psi
```
Alternatively, you can specify individual files or connections for each source on the command line.
## Per-Backend Install Hints
Truthound uses optional dependency groups for database backends. Install only what you need:
| Backend | Install Command |
|---------|----------------|
| PostgreSQL | `pip install truthound[postgresql]` |
| MySQL | `pip install truthound[mysql]` |
| DuckDB | `pip install truthound[duckdb]` |
| SQL Server | `pip install truthound[mssql]` |
| BigQuery | `pip install truthound[bigquery]` |
| Snowflake | `pip install truthound[snowflake]` |
| All databases | `pip install truthound[databases]` |
SQLite support is included in the base install.
## Security Considerations
**Do not put passwords directly in CLI history.** Connection strings with embedded credentials are visible in shell history and process listings.
Recommended practices:
1. **Use environment variables:**
```bash
export DB_CONN="postgresql://user:password@host/db"
truthound check --connection "$DB_CONN" --table users
```
2. **Use source config files** with restricted file permissions:
```bash
chmod 600 db_config.yaml
truthound check --source-config db_config.yaml
```
3. **Use `.pgpass` or equivalent** credential files supported by your database client.
4. **Avoid inline passwords** in CI/CD pipelines. Use secrets management (GitHub Secrets, Vault, etc.) and inject via environment variables.
## Examples for Each Command
### check
```bash
# Validate a PostgreSQL table
truthound check --connection "postgresql://user:pass@host/db" --table orders
# Validate with source config
truthound check --source-config prod_db.yaml --strict
```
### scan
```bash
# Scan a database table for PII
truthound scan --connection "postgresql://user:pass@host/db" --table customers
```
### mask
```bash
# Mask PII in a database table and write to a file
truthound mask --connection "sqlite:///data.db" --table users -o masked_users.csv
```
### profile
```bash
# Profile a database table
truthound profile --connection "postgresql://user:pass@host/db" --table transactions
```
### learn
```bash
# Learn schema from a database table
truthound learn --connection "postgresql://user:pass@host/db" --table products -o schema.yaml
```
### compare
```bash
# Compare two database tables
truthound compare --source-config compare_sources.yaml --method psi --strict
```
### read
```bash
# Preview a database table
truthound read --connection "postgresql://user:pass@host/db" --table users --head 20
# Run a SQL query and export as CSV
truthound read --connection "sqlite:///data.db" --query "SELECT * FROM orders WHERE total > 100" --format csv -o high_orders.csv
```
## See Also
- [Data Sources Overview](index.md)
- [Database Connections](databases.md)
- [CLI Core Commands](../../cli/core/index.md)
"""Shared DataSource resolution for CLI commands.
This module provides a unified abstraction layer that resolves CLI
options (file path, connection string, or config file) into either
a file path string or a BaseDataSource instance. All core CLI commands
use this layer for consistent data source handling.
Architecture:
CLI options → resolve_datasource() → (file_path | None, source | None)
↓ ↓
api.func(data=...) api.func(source=...)
"""
from __future__ import annotations
import json
import logging
from pathlib import Path
from typing import TYPE_CHECKING, Annotated, Any, Optional
import typer
from truthound.cli_modules.common.errors import (
CLIError,
DataSourceError,
ErrorCode,
FileNotFoundError,
require_file,
)
if TYPE_CHECKING:
from truthound.datasources.base import BaseDataSource
logger = logging.getLogger(__name__)
# =============================================================================
# Reusable Annotated CLI Options
# =============================================================================
ConnectionOpt = Annotated[
Optional[str],
typer.Option(
"--connection",
"--conn",
help=(
"Database connection string. "
"Examples: postgresql://user:pass@host:5432/db, "
"mysql://user:pass@host/db, sqlite:///path/to.db"
),
),
]
TableOpt = Annotated[
Optional[str],
typer.Option(
"--table",
help="Database table name (required with --connection for SQL sources)",
),
]
QueryOpt = Annotated[
Optional[str],
typer.Option(
"--query",
help="SQL query to validate (alternative to --table)",
),
]
SourceConfigOpt = Annotated[
Optional[Path],
typer.Option(
"--source-config",
"--sc",
help=(
"Path to data source configuration file (JSON/YAML). "
"See docs for config file format."
),
),
]
SourceNameOpt = Annotated[
Optional[str],
typer.Option(
"--source-name",
help="Custom name for the data source (used in report labels)",
),
]
# =============================================================================
# DataSource Resolution
# =============================================================================
def resolve_datasource(
file: Path | None = None,
connection: str | None = None,
table: str | None = None,
query: str | None = None,
source_config: Path | None = None,
source_name: str | None = None,
) -> tuple[str | None, "BaseDataSource | None"]:
"""Resolve CLI options into a file path or BaseDataSource instance.
This is the central resolution function used by all CLI commands.
It enforces mutual exclusivity between input modes and validates
required parameters for each mode.
Args:
file: Path to a data file (CSV, JSON, Parquet, etc.)
connection: Database connection string
table: Database table name (for SQL sources)
query: SQL query string (alternative to table)
source_config: Path to a JSON/YAML data source config file
source_name: Custom label for the data source
Returns:
A tuple of (file_path, source) where exactly one is non-None.
Raises:
DataSourceError: If inputs are invalid or conflicting.
FileNotFoundError: If the specified file does not exist.
"""
_validate_input_exclusivity(file, connection, source_config)
# Mode 1: Source config file
if source_config is not None:
require_file(source_config, "Source config file")
config = parse_source_config(source_config)
source = create_datasource_from_config(config)
if source_name:
_set_source_name(source, source_name)
return None, source
# Mode 2: Connection string
if connection is not None:
source = _create_from_connection(connection, table, query, source_name)
return None, source
# Mode 3: File path (legacy, default)
if file is not None:
require_file(file)
return str(file), None
# No input provided
raise DataSourceError(
"No data input specified.",
hint=(
"Provide one of:\n"
" - A file path: truthound <command> data.csv\n"
" - A connection: truthound <command> --connection 'postgresql://...' --table users\n"
" - A config file: truthound <command> --source-config db.yaml"
),
)
def resolve_compare_sources(
baseline: Path | None = None,
current: Path | None = None,
source_config: Path | None = None,
) -> tuple[
tuple[str | None, "BaseDataSource | None"],
tuple[str | None, "BaseDataSource | None"],
]:
"""Resolve inputs for the compare command (dual-source).
Args:
baseline: Baseline file path
current: Current file path
source_config: Config file with baseline/current sections
Returns:
Tuple of (baseline_resolution, current_resolution),
each a (file_path | None, source | None) pair.
Raises:
DataSourceError: If inputs are invalid or conflicting.
"""
if source_config is not None:
if baseline is not None or current is not None:
raise DataSourceError(
"Cannot specify both file paths and --source-config for compare.",
hint="Use either positional file args OR --source-config, not both.",
)
require_file(source_config, "Source config file")
config = parse_source_config(source_config)
baseline_cfg = config.get("baseline")
current_cfg = config.get("current")
if not baseline_cfg or not current_cfg:
raise DataSourceError(
"Compare source config must have 'baseline' and 'current' sections.",
hint=(
"Example config:\n"
" baseline:\n"
" connection: postgresql://...\n"
" table: train_data\n"
" current:\n"
" connection: postgresql://...\n"
" table: prod_data"
),
)
baseline_source = create_datasource_from_config(baseline_cfg)
current_source = create_datasource_from_config(current_cfg)
return (None, baseline_source), (None, current_source)
# File-based path
if baseline is None or current is None:
raise DataSourceError(
"Both baseline and current data must be specified.",
hint=(
"Provide two file paths:\n"
" truthound compare baseline.csv current.csv\n"
"Or use --source-config with baseline/current sections."
),
)
require_file(baseline, "Baseline file")
require_file(current, "Current file")
return (str(baseline), None), (str(current), None)
# =============================================================================
# Config File Parsing
# =============================================================================
def parse_source_config(config_path: Path) -> dict[str, Any]:
"""Parse a data source configuration file (JSON or YAML).
Supported formats:
- JSON (.json)
- YAML (.yaml, .yml)
Config schema for single source:
type: postgresql
connection: "postgresql://user:pass@host:5432/db"
table: users
Config schema for compare (dual source):
baseline:
connection: "postgresql://..."
table: train_data
current:
connection: "postgresql://..."
table: prod_data
Args:
config_path: Path to the configuration file.
Returns:
Parsed configuration dictionary.
Raises:
DataSourceError: If the file cannot be parsed.
"""
content = config_path.read_text(encoding="utf-8")
suffix = config_path.suffix.lower()
if suffix == ".json":
try:
config = json.loads(content)
except json.JSONDecodeError as e:
raise DataSourceError(
f"Invalid JSON in source config: {e}",
hint=f"Check the syntax of {config_path}",
)
elif suffix in (".yaml", ".yml"):
try:
import yaml
config = yaml.safe_load(content)
except ImportError:
raise DataSourceError(
"YAML config requires PyYAML.",
hint="Install with: pip install pyyaml",
)
except Exception as e:
raise DataSourceError(
f"Invalid YAML in source config: {e}",
hint=f"Check the syntax of {config_path}",
)
else:
raise DataSourceError(
f"Unsupported config file format: {suffix}",
hint="Use .json, .yaml, or .yml",
)
if not isinstance(config, dict):
raise DataSourceError(
"Source config must be a JSON/YAML object (dictionary).",
hint=f"Check {config_path}",
)
return config
def create_datasource_from_config(config: dict[str, Any]) -> "BaseDataSource":
"""Create a BaseDataSource from a parsed configuration dictionary.
Supports two config styles:
1. Connection string style:
{"connection": "postgresql://...", "table": "users"}
2. Individual parameters style:
{"type": "postgresql", "host": "localhost", "database": "mydb",
"user": "postgres", "password": "...", "table": "users"}
Args:
config: Configuration dictionary.
Returns:
Configured BaseDataSource instance.
Raises:
DataSourceError: If the config is invalid or the backend is unavailable.
"""
from truthound.datasources.factory import get_sql_datasource
from truthound.datasources.sql import get_available_sources
connection = config.get("connection")
table = config.get("table")
query = config.get("query")
source_type = config.get("type")
# Style 1: Connection string
if connection:
if not table and not query:
raise DataSourceError(
"Config with 'connection' requires 'table' or 'query'.",
hint="Add a 'table' or 'query' field to your config file.",
)
try:
return get_sql_datasource(
connection, table=table or "__query__", query=query
)
except Exception as e:
raise DataSourceError(
f"Failed to create data source from connection string: {e}",
source_type=source_type,
)
# Style 2: Individual parameters with type
if not source_type:
raise DataSourceError(
"Config must have either 'connection' or 'type' field.",
hint=(
"Example:\n"
" connection: postgresql://user:pass@host:5432/db\n"
" table: users\n"
"Or:\n"
" type: postgresql\n"
" host: localhost\n"
" database: mydb\n"
" table: users"
),
)
if not table and not query:
raise DataSourceError(
f"Config for type '{source_type}' requires 'table' or 'query'.",
)
available = get_available_sources()
source_cls = available.get(source_type)
if source_cls is None:
available_names = [k for k, v in available.items() if v is not None]
raise DataSourceError(
f"Data source type '{source_type}' is not available.",
source_type=source_type,
hint=(
f"Available types: {', '.join(available_names)}. "
f"You may need to install the required driver."
),
)
# Build constructor kwargs from config (exclude meta keys)
meta_keys = {"type", "table", "query", "name"}
kwargs: dict[str, Any] = {}
if table:
kwargs["table"] = table
if query:
kwargs["query"] = query
for key, value in config.items():
if key not in meta_keys:
kwargs[key] = value
try:
return source_cls(**kwargs)
except TypeError as e:
raise DataSourceError(
f"Invalid config for '{source_type}': {e}",
source_type=source_type,
hint=f"Check the supported parameters for {source_type} data source.",
)
except Exception as e:
raise DataSourceError(
f"Failed to create '{source_type}' data source: {e}",
source_type=source_type,
)
# =============================================================================
# Internal Helpers
# =============================================================================
def _validate_input_exclusivity(
file: Path | None,
connection: str | None,
source_config: Path | None,
) -> None:
"""Validate that at most one data input mode is specified."""
modes = []
if file is not None:
modes.append("file argument")
if connection is not None:
modes.append("--connection")
if source_config is not None:
modes.append("--source-config")
if len(modes) > 1:
raise DataSourceError(
f"Conflicting data inputs: {' and '.join(modes)}.",
hint="Specify only one: a file path, --connection, or --source-config.",
)
def _create_from_connection(
connection: str,
table: str | None,
query: str | None,
source_name: str | None,
) -> "BaseDataSource":
"""Create a BaseDataSource from a connection string."""
from truthound.datasources.factory import get_sql_datasource
if not table and not query:
raise DataSourceError(
"--table or --query is required with --connection.",
hint=(
"Example:\n"
" --connection 'postgresql://user:pass@host/db' --table users\n"
" --connection 'sqlite:///data.db' --query 'SELECT * FROM orders'"
),
)
try:
target = table or "__query__"
source = get_sql_datasource(connection, table=target, query=query)
except ImportError as e:
_raise_driver_hint(connection, e)
except Exception as e:
raise DataSourceError(
f"Failed to connect: {e}",
hint="Check the connection string format and database availability.",
)
if source_name:
_set_source_name(source, source_name)
return source
def _set_source_name(source: "BaseDataSource", name: str) -> None:
"""Attempt to set a custom name on a data source."""
if hasattr(source, "config") and hasattr(source.config, "name"):
try:
source.config.name = name
except (AttributeError, TypeError):
pass
def _raise_driver_hint(connection: str, error: ImportError) -> None:
"""Raise a DataSourceError with install hints based on connection string."""
conn_lower = connection.lower()
hints = {
"postgresql": ("psycopg2-binary", "pip install truthound[postgresql]"),
"postgres": ("psycopg2-binary", "pip install truthound[postgresql]"),
"mysql": ("pymysql", "pip install truthound[mysql]"),
"oracle": ("oracledb", "pip install oracledb"),
"mssql": ("pyodbc", "pip install pyodbc"),
"sqlserver": ("pyodbc", "pip install pyodbc"),
"bigquery": ("google-cloud-bigquery", "pip install truthound[bigquery]"),
"snowflake": ("snowflake-connector-python", "pip install truthound[snowflake]"),
"redshift": ("redshift-connector", "pip install truthound[redshift]"),
"databricks": ("databricks-sql-connector", "pip install truthound[databricks]"),
"duckdb": ("duckdb", "pip install duckdb"),
}
for prefix, (pkg, install_cmd) in hints.items():
if prefix in conn_lower:
raise DataSourceError(
f"Missing driver for {prefix}: {error}",
source_type=prefix,
hint=f"Install with: {install_cmd}",
)
raise DataSourceError(
f"Missing driver: {error}",
hint="Check that the required database driver is installed.",
)
"""Read command - Read and preview data from various sources.
This module implements the ``truthound read`` command for loading,
inspecting, and exporting data from files and database connections.
"""
from __future__ import annotations
from pathlib import Path
from typing import Annotated, Optional
import typer
from truthound.cli_modules.common.datasource import (
ConnectionOpt,
QueryOpt,
SourceConfigOpt,
SourceNameOpt,
TableOpt,
resolve_datasource,
)
from truthound.cli_modules.common.errors import error_boundary
from truthound.cli_modules.common.options import parse_list_callback
@error_boundary
def read_cmd(
file: Annotated[
Optional[Path],
typer.Argument(
help="Path to the data file (CSV, JSON, Parquet, NDJSON)",
),
] = None,
# -- DataSource Options --
connection: ConnectionOpt = None,
table: TableOpt = None,
query: QueryOpt = None,
source_config: SourceConfigOpt = None,
source_name: SourceNameOpt = None,
# -- Row Selection --
sample: Annotated[
Optional[int],
typer.Option(
"--sample",
"-s",
help="Return a random sample of N rows",
min=1,
),
] = None,
head: Annotated[
Optional[int],
typer.Option(
"--head",
"-n",
help="Show only the first N rows",
min=1,
),
] = None,
# -- Column Selection --
columns: Annotated[
Optional[list[str]],
typer.Option(
"--columns",
"-c",
help="Columns to include (comma-separated)",
),
] = None,
# -- Output Options --
format: Annotated[
str,
typer.Option(
"--format",
"-f",
help="Output format (table, csv, json, parquet, ndjson)",
),
] = "table",
output: Annotated[
Optional[Path],
typer.Option("--output", "-o", help="Output file path"),
] = None,
# -- Inspection Modes --
schema_only: Annotated[
bool,
typer.Option(
"--schema-only",
help="Show only column names and types (no data loaded)",
),
] = False,
count_only: Annotated[
bool,
typer.Option(
"--count-only",
help="Show only the row count",
),
] = False,
) -> None:
"""Read and preview data from files or databases.
Load data from various sources and display a preview, export to
another format, or inspect the schema. Supports files (CSV, Parquet,
JSON) and SQL databases via --connection.
Examples:
truthound read data.csv
truthound read data.parquet --head 20
truthound read data.csv --format json -o output.json
truthound read data.csv --columns id,name,age
truthound read --connection "postgresql://user:pass@host/db" --table users
truthound read --connection "sqlite:///data.db" --table orders --head 10
truthound read --source-config db.yaml --sample 1000
truthound read data.csv --schema-only
truthound read data.csv --count-only
"""
import polars as pl
# Resolve data source
data_path, source = resolve_datasource(
file=file,
connection=connection,
table=table,
query=query,
source_config=source_config,
source_name=source_name,
)
# Load data as LazyFrame
if source is not None:
lf = source.to_polars_lazyframe()
label = source.name
else:
from truthound.adapters import to_lazyframe
lf = to_lazyframe(data_path)
label = data_path
# Schema-only mode: no data collection needed
if schema_only:
schema = lf.collect_schema()
typer.echo(f"Source: {label}")
typer.echo(f"Columns: {len(schema)}\n")
typer.echo(f"{'Column':<40} {'Type':<20}")
typer.echo("-" * 60)
for col_name, col_type in schema.items():
typer.echo(f"{col_name:<40} {str(col_type):<20}")
return
# Count-only mode: minimal collection
if count_only:
row_count = lf.select(pl.len()).collect().item()
typer.echo(f"Source: {label}")
typer.echo(f"Rows: {row_count:,}")
return
# Collect data
df = lf.collect()
# Column selection
column_list = parse_list_callback(columns) if columns else None
if column_list:
available = set(df.columns)
missing = [c for c in column_list if c not in available]
if missing:
typer.echo(
f"Warning: columns not found: {', '.join(missing)}", err=True
)
valid_cols = [c for c in column_list if c in available]
if valid_cols:
df = df.select(valid_cols)
# Row selection
if sample is not None and len(df) > sample:
df = df.sample(n=sample, seed=42)
if head is not None:
df = df.head(head)
# Output
if format == "parquet" and output is None:
typer.echo(
"Error: --output is required for parquet format", err=True
)
raise typer.Exit(1)
if output:
_write_output(df, output, format)
typer.echo(f"Data written to {output} ({len(df):,} rows)")
else:
_print_output(df, format, label)
def _write_output(df: "pl.DataFrame", output: Path, fmt: str) -> None:
"""Write DataFrame to a file in the specified format."""
suffix = output.suffix.lower()
fmt_lower = fmt.lower()
if fmt_lower == "parquet" or suffix == ".parquet":
df.write_parquet(output)
elif fmt_lower == "csv" or suffix == ".csv":
df.write_csv(output)
elif fmt_lower == "json" or suffix == ".json":
df.write_json(output)
elif fmt_lower == "ndjson" or suffix == ".ndjson":
df.write_ndjson(output)
else:
# Default: CSV
df.write_csv(output)
def _print_output(df: "pl.DataFrame", fmt: str, label: str | None) -> None:
"""Print DataFrame to stdout."""
import polars as pl
fmt_lower = fmt.lower()
if fmt_lower == "json":
typer.echo(df.write_json())
elif fmt_lower == "csv":
typer.echo(df.write_csv())
elif fmt_lower == "ndjson":
typer.echo(df.write_ndjson())
else:
# Table format: use Polars' built-in display
if label:
typer.echo(f"Source: {label}")
typer.echo(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns\n")
with pl.Config(tbl_rows=50, tbl_cols=20, fmt_str_lengths=80):
typer.echo(str(df))
"""Tests for DataSource support across CLI commands.
Verifies that scan, mask, profile, learn, and compare commands
correctly accept and pass through database connection options.
"""
from __future__ import annotations
import json
import pytest
from pathlib import Path
from unittest.mock import MagicMock, patch
import polars as pl
import typer
from typer.testing import CliRunner
from truthound.cli_modules.core.check import check_cmd
from truthound.cli_modules.core.scan import scan_cmd
from truthound.cli_modules.core.mask import mask_cmd
from truthound.cli_modules.core.profile import profile_cmd
from truthound.cli_modules.core.learn import learn_cmd
from truthound.cli_modules.core.compare import compare_cmd
@pytest.fixture
def runner():
return CliRunner()
def _make_app(cmd, name):
app = typer.Typer()
app.command(name=name)(cmd)
return app
@pytest.fixture
def sample_csv(tmp_path):
csv = tmp_path / "data.csv"
csv.write_text("id,name,age\n1,Alice,25\n2,Bob,30\n")
return csv
def _mock_sql_source(table_name="users"):
"""Create a mock SQL data source returning a small DataFrame."""
source = MagicMock()
source.name = table_name
lf = pl.LazyFrame({"id": [1, 2], "name": ["Alice", "Bob"], "age": [25, 30]})
source.to_polars_lazyframe.return_value = lf
return source
# =============================================================================
# Check with DataSource
# =============================================================================
class TestCheckWithDatasource:
"""Test check command accepts datasource options."""
def test_check_with_connection(self, runner, sample_csv):
"""--connection passes source= to check API."""
app = _make_app(check_cmd, "check")
with (
patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
patch("truthound.api.check") as mock_check,
):
mock_sql.return_value = _mock_sql_source()
mock_report = MagicMock()
mock_report.has_issues = False
mock_report.exception_summary = None
mock_check.return_value = mock_report
result = runner.invoke(app, [
"--connection", "postgresql://user:pass@host/db",
"--table", "users",
])
assert result.exit_code == 0
# Verify source= was passed (not data path)
assert mock_check.call_args[1].get("source") is not None or \
mock_check.call_args.kwargs.get("source") is not None
def test_check_file_and_connection_mutually_exclusive(self, runner, sample_csv):
"""file + --connection raises error."""
app = _make_app(check_cmd, "check")
result = runner.invoke(app, [
str(sample_csv),
"--connection", "postgresql://host/db",
"--table", "t",
])
assert result.exit_code != 0
# =============================================================================
# Scan with DataSource
# =============================================================================
class TestScanWithDatasource:
"""Test scan command accepts datasource options."""
def test_scan_with_connection(self, runner):
"""--connection passes source= to scan API."""
app = _make_app(scan_cmd, "scan")
with (
patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
patch("truthound.api.scan") as mock_scan,
):
mock_sql.return_value = _mock_sql_source()
mock_report = MagicMock()
mock_scan.return_value = mock_report
mock_report.print = MagicMock()
result = runner.invoke(app, [
"--connection", "postgresql://user:pass@host/db",
"--table", "users",
])
assert result.exit_code == 0
mock_scan.assert_called_once()
call_kwargs = mock_scan.call_args
# scan(source=source)
assert call_kwargs.kwargs.get("source") is not None
def test_scan_no_input_error(self, runner):
"""No input produces an error."""
app = _make_app(scan_cmd, "scan")
result = runner.invoke(app, [])
assert result.exit_code != 0
# =============================================================================
# Mask with DataSource
# =============================================================================
class TestMaskWithDatasource:
"""Test mask command accepts datasource options."""
def test_mask_with_connection(self, runner, tmp_path):
"""--connection passes source= to mask API."""
app = _make_app(mask_cmd, "mask")
out = tmp_path / "masked.csv"
with (
patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
patch("truthound.api.mask") as mock_mask,
):
mock_sql.return_value = _mock_sql_source()
mock_df = pl.DataFrame({"id": [1, 2], "name": ["***", "***"]})
mock_mask.return_value = mock_df
result = runner.invoke(app, [
"--connection", "postgresql://user:pass@host/db",
"--table", "users",
"--output", str(out),
])
assert result.exit_code == 0
assert out.exists()
mock_mask.assert_called_once()
# =============================================================================
# Profile with DataSource
# =============================================================================
class TestProfileWithDatasource:
"""Test profile command accepts datasource options."""
def test_profile_with_connection(self, runner):
"""--connection passes source= to profile API."""
app = _make_app(profile_cmd, "profile")
with (
patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
patch("truthound.api.profile") as mock_profile,
):
mock_sql.return_value = _mock_sql_source()
mock_report = MagicMock()
mock_profile.return_value = mock_report
mock_report.print = MagicMock()
result = runner.invoke(app, [
"--connection", "postgresql://user:pass@host/db",
"--table", "users",
])
assert result.exit_code == 0
mock_profile.assert_called_once()
assert mock_profile.call_args.kwargs.get("source") is not None
# =============================================================================
# Learn with DataSource
# =============================================================================
class TestLearnWithDatasource:
"""Test learn command accepts datasource options."""
def test_learn_with_connection(self, runner, tmp_path):
"""--connection passes source= to learn API."""
app = _make_app(learn_cmd, "learn")
out = tmp_path / "schema.yaml"
with (
patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
patch("truthound.schema.learn") as mock_learn,
):
mock_sql.return_value = _mock_sql_source()
mock_schema = MagicMock()
mock_schema.columns = ["id", "name", "age"]
mock_schema.row_count = 2
mock_learn.return_value = mock_schema
result = runner.invoke(app, [
"--connection", "postgresql://user:pass@host/db",
"--table", "users",
"--output", str(out),
])
assert result.exit_code == 0
mock_learn.assert_called_once()
assert mock_learn.call_args.kwargs.get("source") is not None
# =============================================================================
# Compare with DataSource Config
# =============================================================================
class TestCompareWithDatasource:
"""Test compare command accepts --source-config for dual sources."""
def test_compare_with_source_config(self, runner, tmp_path):
"""--source-config with baseline/current sections works."""
app = _make_app(compare_cmd, "compare")
cfg = tmp_path / "drift.yaml"
cfg.write_text(
"baseline:\n"
" connection: 'postgresql://host/db'\n"
" table: train\n"
"current:\n"
" connection: 'postgresql://host/db'\n"
" table: prod\n"
)
with (
patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
patch("truthound.drift.compare") as mock_compare,
):
source_b = MagicMock()
source_c = MagicMock()
lf_b = pl.LazyFrame({"x": [1, 2, 3]})
lf_c = pl.LazyFrame({"x": [4, 5, 6]})
source_b.to_polars_lazyframe.return_value = lf_b
source_c.to_polars_lazyframe.return_value = lf_c
mock_sql.side_effect = [source_b, source_c]
mock_report = MagicMock()
mock_report.has_drift = False
mock_compare.return_value = mock_report
mock_report.print = MagicMock()
result = runner.invoke(app, [
"--source-config", str(cfg),
])
assert result.exit_code == 0
mock_compare.assert_called_once()
def test_compare_files_still_works(self, runner, tmp_path):
"""Positional file arguments still work."""
app = _make_app(compare_cmd, "compare")
f1 = tmp_path / "base.csv"
f2 = tmp_path / "curr.csv"
f1.write_text("x\n1\n2\n3\n")
f2.write_text("x\n4\n5\n6\n")
with patch("truthound.drift.compare") as mock_compare:
mock_report = MagicMock()
mock_report.has_drift = False
mock_compare.return_value = mock_report
mock_report.print = MagicMock()
result = runner.invoke(app, [str(f1), str(f2)])
assert result.exit_code == 0
mock_compare.assert_called_once()
"""Tests for the shared DataSource resolution layer.
Tests cover: resolve_datasource(), resolve_compare_sources(),
parse_source_config(), create_datasource_from_config(), and
input validation logic.
"""
from __future__ import annotations
import json
import pytest
from pathlib import Path
from unittest.mock import MagicMock, patch
from truthound.cli_modules.common.datasource import (
create_datasource_from_config,
parse_source_config,
resolve_compare_sources,
resolve_datasource,
)
from truthound.cli_modules.common.errors import DataSourceError
# =============================================================================
# Fixtures
# =============================================================================
@pytest.fixture
def sample_csv(tmp_path):
"""Create a sample CSV file."""
csv = tmp_path / "data.csv"
csv.write_text("id,name\n1,Alice\n2,Bob\n")
return csv
@pytest.fixture
def source_config_json(tmp_path):
"""Create a JSON source config file."""
cfg = tmp_path / "source.json"
cfg.write_text(json.dumps({
"connection": "postgresql://user:pass@host:5432/db",
"table": "users",
}))
return cfg
@pytest.fixture
def source_config_yaml(tmp_path):
"""Create a YAML source config file."""
cfg = tmp_path / "source.yaml"
cfg.write_text(
"connection: 'postgresql://user:pass@host:5432/db'\n"
"table: users\n"
)
return cfg
@pytest.fixture
def compare_config_yaml(tmp_path):
"""Create a YAML compare config file with baseline/current sections."""
cfg = tmp_path / "compare.yaml"
cfg.write_text(
"baseline:\n"
" connection: 'postgresql://user:pass@host/db'\n"
" table: train_data\n"
"current:\n"
" connection: 'postgresql://user:pass@host/db'\n"
" table: prod_data\n"
)
return cfg
# =============================================================================
# resolve_datasource
# =============================================================================
class TestResolveDatasource:
"""Tests for resolve_datasource()."""
def test_file_only_returns_path(self, sample_csv):
"""File-only input returns (str_path, None)."""
data_path, source = resolve_datasource(file=sample_csv)
assert data_path == str(sample_csv)
assert source is None
def test_no_input_raises_error(self):
"""No input raises DataSourceError."""
with pytest.raises(DataSourceError, match="No data input specified"):
resolve_datasource()
def test_file_not_found_raises_error(self, tmp_path):
"""Non-existent file raises error."""
fake = tmp_path / "nonexistent.csv"
with pytest.raises(Exception):
resolve_datasource(file=fake)
def test_file_and_connection_mutually_exclusive(self, sample_csv):
"""Providing both file and connection raises error."""
with pytest.raises(DataSourceError, match="Conflicting"):
resolve_datasource(file=sample_csv, connection="postgresql://host/db")
def test_file_and_source_config_mutually_exclusive(self, sample_csv, source_config_json):
"""Providing both file and source_config raises error."""
with pytest.raises(DataSourceError, match="Conflicting"):
resolve_datasource(file=sample_csv, source_config=source_config_json)
def test_connection_and_source_config_mutually_exclusive(self, source_config_json):
"""Providing both connection and source_config raises error."""
with pytest.raises(DataSourceError, match="Conflicting"):
resolve_datasource(
connection="postgresql://host/db",
source_config=source_config_json,
)
def test_connection_without_table_raises_error(self):
"""Connection without table or query raises error."""
with pytest.raises(DataSourceError, match="--table or --query"):
resolve_datasource(connection="postgresql://user:pass@host/db")
@patch("truthound.datasources.factory.get_sql_datasource")
def test_connection_with_table_returns_source(self, mock_get_sql):
"""Connection + table returns (None, source)."""
mock_source = MagicMock()
mock_get_sql.return_value = mock_source
data_path, source = resolve_datasource(
connection="postgresql://user:pass@host/db",
table="users",
)
assert data_path is None
assert source is mock_source
mock_get_sql.assert_called_once_with(
"postgresql://user:pass@host/db", table="users", query=None
)
@patch("truthound.datasources.factory.get_sql_datasource")
def test_connection_with_query_returns_source(self, mock_get_sql):
"""Connection + query returns (None, source)."""
mock_source = MagicMock()
mock_get_sql.return_value = mock_source
data_path, source = resolve_datasource(
connection="postgresql://user:pass@host/db",
query="SELECT * FROM orders WHERE date > '2024-01-01'",
)
assert data_path is None
assert source is mock_source
@patch("truthound.cli_modules.common.datasource.create_datasource_from_config")
@patch("truthound.cli_modules.common.datasource.parse_source_config")
def test_source_config_returns_source(self, mock_parse, mock_create, source_config_json):
"""Source config file returns (None, source)."""
mock_config = {"connection": "postgresql://...", "table": "users"}
mock_parse.return_value = mock_config
mock_source = MagicMock()
mock_create.return_value = mock_source
data_path, source = resolve_datasource(source_config=source_config_json)
assert data_path is None
assert source is mock_source
mock_parse.assert_called_once_with(source_config_json)
mock_create.assert_called_once_with(mock_config)
@patch("truthound.datasources.factory.get_sql_datasource")
def test_source_name_applied(self, mock_get_sql):
"""--source-name is applied to the data source."""
mock_source = MagicMock()
mock_source.config = MagicMock()
mock_get_sql.return_value = mock_source
resolve_datasource(
connection="postgresql://host/db",
table="users",
source_name="my-label",
)
# source_name should have been set
assert mock_source.config.name == "my-label"
# =============================================================================
# resolve_compare_sources
# =============================================================================
class TestResolveCompareSources:
"""Tests for resolve_compare_sources()."""
def test_two_files_returns_paths(self, tmp_path):
"""Two file paths return ((path1, None), (path2, None))."""
f1 = tmp_path / "base.csv"
f2 = tmp_path / "curr.csv"
f1.write_text("a\n1\n")
f2.write_text("a\n2\n")
(bp, bs), (cp, cs) = resolve_compare_sources(baseline=f1, current=f2)
assert bp == str(f1) and bs is None
assert cp == str(f2) and cs is None
def test_missing_one_file_raises_error(self, tmp_path):
"""Only one file provided raises error."""
f1 = tmp_path / "base.csv"
f1.write_text("a\n1\n")
with pytest.raises(DataSourceError, match="Both baseline and current"):
resolve_compare_sources(baseline=f1)
def test_no_files_no_config_raises_error(self):
"""No arguments raises error."""
with pytest.raises(DataSourceError, match="Both baseline and current"):
resolve_compare_sources()
def test_files_and_config_raises_error(self, tmp_path, compare_config_yaml):
"""Files + config raises error."""
f1 = tmp_path / "base.csv"
f1.write_text("a\n1\n")
with pytest.raises(DataSourceError, match="Cannot specify both"):
resolve_compare_sources(baseline=f1, source_config=compare_config_yaml)
@patch("truthound.cli_modules.common.datasource.create_datasource_from_config")
@patch("truthound.cli_modules.common.datasource.parse_source_config")
def test_config_returns_dual_sources(self, mock_parse, mock_create, compare_config_yaml):
"""Config file with baseline/current returns two sources."""
mock_parse.return_value = {
"baseline": {"connection": "pg://...", "table": "train"},
"current": {"connection": "pg://...", "table": "prod"},
}
mock_source_b = MagicMock()
mock_source_c = MagicMock()
mock_create.side_effect = [mock_source_b, mock_source_c]
(bp, bs), (cp, cs) = resolve_compare_sources(source_config=compare_config_yaml)
assert bp is None and bs is mock_source_b
assert cp is None and cs is mock_source_c
@patch("truthound.cli_modules.common.datasource.parse_source_config")
def test_config_missing_baseline_raises_error(self, mock_parse, compare_config_yaml):
"""Config missing baseline section raises error."""
mock_parse.return_value = {"current": {"connection": "pg://...", "table": "t"}}
with pytest.raises(DataSourceError, match="baseline.*current"):
resolve_compare_sources(source_config=compare_config_yaml)
# =============================================================================
# parse_source_config
# =============================================================================
class TestParseSourceConfig:
"""Tests for parse_source_config()."""
def test_parse_json(self, tmp_path):
"""JSON config file is parsed correctly."""
cfg = tmp_path / "cfg.json"
cfg.write_text(json.dumps({"connection": "pg://host/db", "table": "t"}))
result = parse_source_config(cfg)
assert result["connection"] == "pg://host/db"
assert result["table"] == "t"
def test_parse_yaml(self, tmp_path):
"""YAML config file is parsed correctly."""
cfg = tmp_path / "cfg.yaml"
cfg.write_text("connection: 'pg://host/db'\ntable: t\n")
result = parse_source_config(cfg)
assert result["connection"] == "pg://host/db"
assert result["table"] == "t"
def test_parse_yml(self, tmp_path):
"""YML extension also works."""
cfg = tmp_path / "cfg.yml"
cfg.write_text("connection: 'pg://host/db'\ntable: t\n")
result = parse_source_config(cfg)
assert result["table"] == "t"
def test_invalid_json_raises_error(self, tmp_path):
"""Malformed JSON raises DataSourceError."""
cfg = tmp_path / "bad.json"
cfg.write_text("{invalid json}")
with pytest.raises(DataSourceError, match="Invalid JSON"):
parse_source_config(cfg)
def test_non_dict_raises_error(self, tmp_path):
"""Non-dict JSON content raises DataSourceError."""
cfg = tmp_path / "arr.json"
cfg.write_text('["a", "b"]')
with pytest.raises(DataSourceError, match="must be a JSON/YAML object"):
parse_source_config(cfg)
def test_unsupported_extension_raises_error(self, tmp_path):
"""Unsupported file extension raises DataSourceError."""
cfg = tmp_path / "cfg.toml"
cfg.write_text("[table]\nname = 'x'")
with pytest.raises(DataSourceError, match="Unsupported config file format"):
parse_source_config(cfg)
# =============================================================================
# create_datasource_from_config
# =============================================================================
class TestCreateDatasourceFromConfig:
"""Tests for create_datasource_from_config()."""
@patch("truthound.datasources.factory.get_sql_datasource")
def test_connection_string_style(self, mock_get_sql):
"""Config with 'connection' delegates to get_sql_datasource."""
mock_source = MagicMock()
mock_get_sql.return_value = mock_source
result = create_datasource_from_config({
"connection": "postgresql://host/db",
"table": "users",
})
assert result is mock_source
def test_connection_without_table_raises_error(self):
"""Config with 'connection' but no 'table' raises error."""
with pytest.raises(DataSourceError, match="requires 'table' or 'query'"):
create_datasource_from_config({"connection": "postgresql://host/db"})
def test_no_connection_no_type_raises_error(self):
"""Config without 'connection' or 'type' raises error."""
with pytest.raises(DataSourceError, match="must have either"):
create_datasource_from_config({"table": "users"})
@patch("truthound.datasources.sql.get_available_sources")
def test_type_not_available_raises_error(self, mock_available):
"""Unavailable type raises DataSourceError with available list."""
mock_available.return_value = {"postgresql": MagicMock(), "mysql": MagicMock()}
with pytest.raises(DataSourceError, match="not available"):
create_datasource_from_config({
"type": "oracle",
"table": "users",
"host": "localhost",
})
@patch("truthound.datasources.sql.get_available_sources")
def test_type_style_creates_source(self, mock_available):
"""Config with 'type' constructs from source class."""
mock_cls = MagicMock()
mock_source = MagicMock()
mock_cls.return_value = mock_source
mock_available.return_value = {"postgresql": mock_cls}
result = create_datasource_from_config({
"type": "postgresql",
"table": "users",
"host": "localhost",
"database": "mydb",
})
assert result is mock_source
mock_cls.assert_called_once_with(
table="users", host="localhost", database="mydb"
)
"""Tests for the ``truthound read`` CLI command."""
from __future__ import annotations
import json
import pytest
from pathlib import Path
from unittest.mock import patch, MagicMock
import typer
from typer.testing import CliRunner
from truthound.cli_modules.core.read import read_cmd
@pytest.fixture
def runner():
return CliRunner()
@pytest.fixture
def app():
_app = typer.Typer()
_app.command(name="read")(read_cmd)
return _app
@pytest.fixture
def sample_csv(tmp_path):
csv = tmp_path / "data.csv"
csv.write_text(
"id,name,age,city\n"
"1,Alice,25,NYC\n"
"2,Bob,30,LA\n"
"3,Charlie,35,Chicago\n"
"4,Diana,40,Boston\n"
"5,Eve,28,Seattle\n"
)
return csv
@pytest.fixture
def sample_json(tmp_path):
jf = tmp_path / "data.json"
data = [
{"id": 1, "name": "Alice"},
{"id": 2, "name": "Bob"},
]
jf.write_text(json.dumps(data))
return jf
# =============================================================================
# Basic Read
# =============================================================================
class TestReadBasic:
"""Basic file reading tests."""
def test_read_csv(self, runner, app, sample_csv):
"""Read CSV file outputs data."""
result = runner.invoke(app, [str(sample_csv)])
assert result.exit_code == 0
assert "5 rows" in result.output or "Shape" in result.output
def test_read_no_input_error(self, runner, app):
"""No input produces an error."""
result = runner.invoke(app, [])
assert result.exit_code != 0
def test_read_nonexistent_file_error(self, runner, app, tmp_path):
"""Non-existent file produces an error."""
fake = tmp_path / "missing.csv"
result = runner.invoke(app, [str(fake)])
assert result.exit_code != 0
# =============================================================================
# Row/Column Selection
# =============================================================================
class TestReadSelection:
"""Row and column selection tests."""
def test_head(self, runner, app, sample_csv):
"""--head limits rows."""
result = runner.invoke(app, [str(sample_csv), "--head", "2"])
assert result.exit_code == 0
assert "2 rows" in result.output or "Shape: 2" in result.output
def test_columns(self, runner, app, sample_csv):
"""--columns selects specific columns."""
result = runner.invoke(app, [str(sample_csv), "--columns", "id,name"])
assert result.exit_code == 0
assert "2 columns" in result.output or "x 2" in result.output
def test_columns_missing_warns(self, runner, app, sample_csv):
"""Missing columns produce a warning."""
result = runner.invoke(app, [str(sample_csv), "--columns", "id,nonexistent"])
assert result.exit_code == 0
assert "not found" in result.output
def test_head_and_columns(self, runner, app, sample_csv):
"""--head and --columns together."""
result = runner.invoke(app, [str(sample_csv), "--head", "3", "--columns", "name,age"])
assert result.exit_code == 0
def test_sample(self, runner, app, sample_csv):
"""--sample returns subset."""
result = runner.invoke(app, [str(sample_csv), "--sample", "2"])
assert result.exit_code == 0
assert "2 rows" in result.output or "Shape: 2" in result.output
# =============================================================================
# Inspection Modes
# =============================================================================
class TestReadInspection:
"""Schema-only and count-only mode tests."""
def test_schema_only(self, runner, app, sample_csv):
"""--schema-only shows column names and types."""
result = runner.invoke(app, [str(sample_csv), "--schema-only"])
assert result.exit_code == 0
assert "Column" in result.output
assert "Type" in result.output
assert "id" in result.output
assert "name" in result.output
def test_count_only(self, runner, app, sample_csv):
"""--count-only shows just the row count."""
result = runner.invoke(app, [str(sample_csv), "--count-only"])
assert result.exit_code == 0
assert "Rows:" in result.output
assert "5" in result.output
# =============================================================================
# Output Formats
# =============================================================================
class TestReadFormats:
"""Output format tests."""
def test_format_csv(self, runner, app, sample_csv):
"""--format csv outputs CSV text."""
result = runner.invoke(app, [str(sample_csv), "--format", "csv", "--head", "2"])
assert result.exit_code == 0
assert "id,name,age,city" in result.output
def test_format_json(self, runner, app, sample_csv):
"""--format json outputs valid JSON."""
result = runner.invoke(app, [str(sample_csv), "--format", "json", "--head", "2"])
assert result.exit_code == 0
data = json.loads(result.output)
# Polars write_json output is valid JSON (format may vary by version)
assert isinstance(data, (dict, list))
def test_format_ndjson(self, runner, app, sample_csv):
"""--format ndjson outputs newline-delimited JSON."""
result = runner.invoke(app, [str(sample_csv), "--format", "ndjson", "--head", "2"])
assert result.exit_code == 0
lines = [l for l in result.output.strip().split("\n") if l.strip()]
assert len(lines) == 2
def test_parquet_requires_output(self, runner, app, sample_csv):
"""--format parquet without --output is an error."""
result = runner.invoke(app, [str(sample_csv), "--format", "parquet"])
assert result.exit_code == 1
assert "required" in result.output.lower()
# =============================================================================
# Output File
# =============================================================================
class TestReadOutput:
"""Output file tests."""
def test_output_csv(self, runner, app, sample_csv, tmp_path):
"""--output writes CSV file."""
out = tmp_path / "out.csv"
result = runner.invoke(app, [str(sample_csv), "--output", str(out), "--head", "3"])
assert result.exit_code == 0
assert out.exists()
assert "written to" in result.output
content = out.read_text()
assert "id" in content
def test_output_json(self, runner, app, sample_csv, tmp_path):
"""--output with json format writes JSON file."""
out = tmp_path / "out.json"
result = runner.invoke(app, [
str(sample_csv), "--output", str(out), "--format", "json", "--head", "2",
])
assert result.exit_code == 0
assert out.exists()
def test_output_parquet(self, runner, app, sample_csv, tmp_path):
"""--output with parquet format writes Parquet file."""
out = tmp_path / "out.parquet"
result = runner.invoke(app, [
str(sample_csv), "--output", str(out), "--format", "parquet",
])
assert result.exit_code == 0
assert out.exists()
assert out.stat().st_size > 0
# =============================================================================
# DataSource Integration (mocked)
# =============================================================================
class TestReadWithConnection:
"""Test read command with mocked database connection."""
@patch("truthound.datasources.factory.get_sql_datasource")
def test_read_with_connection(self, mock_get_sql, runner, app):
"""--connection + --table uses DataSource."""
import polars as pl
mock_source = MagicMock()
mock_source.name = "test_table"
mock_lf = pl.LazyFrame({"id": [1, 2], "name": ["a", "b"]})
mock_source.to_polars_lazyframe.return_value = mock_lf
mock_get_sql.return_value = mock_source
result = runner.invoke(app, [
"--connection", "postgresql://user:pass@host/db",
"--table", "users",
])
assert result.exit_code == 0
assert "2 rows" in result.output or "Shape" in result.output
@patch("truthound.datasources.factory.get_sql_datasource")
def test_read_schema_only_with_connection(self, mock_get_sql, runner, app):
"""--schema-only works with database source."""
import polars as pl
mock_source = MagicMock()
mock_source.name = "test_table"
mock_lf = pl.LazyFrame({"id": [1], "name": ["a"]})
mock_source.to_polars_lazyframe.return_value = mock_lf
mock_get_sql.return_value = mock_source
result = runner.invoke(app, [
"--connection", "postgresql://user:pass@host/db",
"--table", "users",
"--schema-only",
])
assert result.exit_code == 0
assert "id" in result.output
assert "name" in result.output
+95
-2

@@ -8,3 +8,3 @@ # truthound check

```bash
truthound check <file> [OPTIONS]
truthound check [FILE] [OPTIONS]
```

@@ -16,4 +16,14 @@

|----------|----------|-------------|
| `file` | Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
| `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
## Data Source Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--connection` | `--conn` | None | Database connection string |
| `--table` | | None | Database table name |
| `--query` | | None | SQL query (alternative to `--table`) |
| `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) |
| `--source-name` | | None | Custom label for the data source |
## Options

@@ -42,2 +52,5 @@

| `--max-unexpected-rows` | | `1000` | Maximum number of unexpected rows to include |
| `--partial-unexpected-count` | | `20` | Maximum number of unexpected values in partial list (BASIC+) |
| `--include-unexpected-index` | | `false` | Include row index for each unexpected value in results |
| `--return-debug-query` | | `false` | Include Polars debug query expression in results (COMPLETE level) |

@@ -52,2 +65,11 @@ ### Exception Handling Options (VE-5)

### Execution Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--parallel` / `--no-parallel` | | `false` | Enable DAG-based parallel execution with dependency-aware scheduling |
| `--max-workers` | | Auto | Maximum worker threads (only with `--parallel`). Defaults to `min(32, cpu_count + 4)` |
| `--pushdown` / `--no-pushdown` | | Auto | Enable query pushdown for SQL data sources. Auto-detects by default |
| `--use-engine` / `--no-use-engine` | | `false` | Use execution engine for validation (experimental) |
## Description

@@ -149,2 +171,73 @@

### Parallel Execution
Enable DAG-based parallel execution for large validator sets:
```bash
# Enable parallel execution with automatic worker count
truthound check data.csv --parallel
# Control the number of worker threads
truthound check data.csv --parallel --max-workers 8
# Combine with other options
truthound check data.csv --parallel --max-workers 4 --rf summary --strict
```
Validators are organized into dependency levels (Schema → Completeness → Uniqueness → Distribution → Referential) and executed concurrently within each level.
### Advanced Result Format Control
Fine-tune the detail level of validation results:
```bash
# Control partial unexpected list size
truthound check data.csv --rf basic --partial-unexpected-count 50
# Include row indices for unexpected values
truthound check data.csv --rf summary --include-unexpected-index
# Include Polars debug query in results (for troubleshooting)
truthound check data.csv --rf complete --return-debug-query
# All fine-grained options combined
truthound check data.csv --rf complete \
--include-unexpected-rows \
--max-unexpected-rows 500 \
--partial-unexpected-count 100 \
--include-unexpected-index \
--return-debug-query
```
### Database Validation
Validate data directly from a database connection:
```bash
# Validate a PostgreSQL table
truthound check --connection "postgresql://user:pass@host/db" --table users
# Validate with a SQL query
truthound check --connection "sqlite:///data.db" --query "SELECT * FROM orders WHERE status = 'active'"
# Validate using a source config file
truthound check --source-config db_config.yaml --strict
# Combine with other options
truthound check --connection "postgresql://user:pass@host/db" --table users \
-v null,unique --rf summary --strict
```
### Query Pushdown
For SQL data sources, enable server-side validation:
```bash
# Auto-detect pushdown capability
truthound check data.csv --pushdown
# Explicitly disable pushdown
truthound check data.csv --no-pushdown
```
### Exception Handling (VE-5)

@@ -151,0 +244,0 @@

@@ -18,2 +18,9 @@ # truthound compare

## Data Source Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) for dual-source comparison |
| `--source-name` | | None | Custom label for the data source |
## Options

@@ -26,2 +33,3 @@

| `--threshold` | `-t` | Auto | Custom drift threshold |
| `--sample-size` | `--sample` | None | Sample size for large datasets (random sampling) |
| `--format` | `-f` | `console` | Output format (console, json) |

@@ -133,2 +141,15 @@ | `--output` | `-o` | None | Output file path |

### Large Dataset Sampling
For large datasets, use sampling for faster comparison:
```bash
# Sample 10,000 rows from each dataset
truthound compare big_train.csv big_prod.csv --sample-size 10000
# Combine with method and threshold
truthound compare large_baseline.parquet large_current.parquet \
--sample-size 50000 --method psi --threshold 0.15
```
### Custom Threshold

@@ -135,0 +156,0 @@

@@ -14,2 +14,3 @@ # Core Commands

| [`profile`](profile.md) | Generate data profile | Data exploration |
| [`read`](read.md) | Read and preview data | Data inspection |
| [`compare`](compare.md) | Detect data drift | Model monitoring |

@@ -114,2 +115,27 @@

## Data Source Options
All core commands accept data source options for reading directly from databases instead of files. When using these options, the file argument becomes optional.
| Option | Short | Description |
|--------|-------|-------------|
| `--connection` | `--conn` | Database connection string (e.g., `postgresql://user:pass@host/db`) |
| `--table` | | Database table name |
| `--query` | | SQL query (alternative to `--table`) |
| `--source-config` | `--sc` | Path to a data source config file (JSON/YAML) |
| `--source-name` | | Custom label for the data source |
```bash
# Validate a database table directly
truthound check --connection "postgresql://user:pass@host/db" --table users --strict
# Profile from a source config file
truthound profile --source-config prod_db.yaml
# Read and preview database data
truthound read --connection "sqlite:///data.db" --table orders --head 20
```
For full details on connection string formats, config files, and security best practices, see the [CLI Data Source Guide](../../guides/datasources/cli-datasource-guide.md).
## CI/CD Integration

@@ -130,2 +156,3 @@

- [read](read.md) - Read and preview data
- [learn](learn.md) - Learn schema from data

@@ -132,0 +159,0 @@ - [check](check.md) - Validate data quality

@@ -8,3 +8,3 @@ # truthound learn

```bash
truthound learn <file> [OPTIONS]
truthound learn [FILE] [OPTIONS]
```

@@ -16,4 +16,14 @@

|----------|----------|-------------|
| `file` | Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
| `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
## Data Source Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--connection` | `--conn` | None | Database connection string |
| `--table` | | None | Database table name |
| `--query` | | None | SQL query (alternative to `--table`) |
| `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) |
| `--source-name` | | None | Custom label for the data source |
## Options

@@ -25,2 +35,3 @@

| `--no-constraints` | | `false` | Don't infer constraints from data |
| `--categorical-threshold` | | `20` | Maximum unique values to treat a column as categorical |

@@ -73,2 +84,19 @@ ## Description

### Categorical Threshold
Control when columns are treated as categorical:
```bash
# Default: columns with <= 20 unique values are categorical
truthound learn data.csv
# Higher threshold: treat columns with up to 50 unique values as categorical
truthound learn data.csv --categorical-threshold 50
# Lower threshold: only truly low-cardinality columns
truthound learn data.csv --categorical-threshold 5
```
Columns classified as categorical will have `allowed_values` in the generated schema, enabling strict enum validation.
### From Different File Formats

@@ -75,0 +103,0 @@

@@ -8,3 +8,3 @@ # truthound mask

```bash
truthound mask <file> -o <output> [OPTIONS]
truthound mask [FILE] -o <output> [OPTIONS]
```

@@ -16,4 +16,14 @@

|----------|----------|-------------|
| `file` | Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
| `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
## Data Source Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--connection` | `--conn` | None | Database connection string |
| `--table` | | None | Database table name |
| `--query` | | None | SQL query (alternative to `--table`) |
| `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) |
| `--source-name` | | None | Custom label for the data source |
## Options

@@ -20,0 +30,0 @@

@@ -8,3 +8,3 @@ # truthound profile

```bash
truthound profile <file> [OPTIONS]
truthound profile [FILE] [OPTIONS]
```

@@ -16,4 +16,14 @@

|----------|----------|-------------|
| `file` | Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
| `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
## Data Source Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--connection` | `--conn` | None | Database connection string |
| `--table` | | None | Database table name |
| `--query` | | None | SQL query (alternative to `--table`) |
| `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) |
| `--source-name` | | None | Custom label for the data source |
## Options

@@ -20,0 +30,0 @@

@@ -8,3 +8,3 @@ # truthound scan

```bash
truthound scan <file> [OPTIONS]
truthound scan [FILE] [OPTIONS]
```

@@ -16,4 +16,14 @@

|----------|----------|-------------|
| `file` | Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
| `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
## Data Source Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--connection` | `--conn` | None | Database connection string |
| `--table` | | None | Database table name |
| `--query` | | None | SQL query (alternative to `--table`) |
| `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) |
| `--source-name` | | None | Custom label for the data source |
## Options

@@ -20,0 +30,0 @@

+19
-5
Metadata-Version: 2.4
Name: truthound
Version: 1.3.2
Version: 1.5.0
Summary: Zero-Configuration Data Quality Framework Powered by Polars

@@ -145,3 +145,3 @@ Project-URL: Homepage, https://github.com/seadonggyun4/Truthound

| Test Cases | 8,585+ |
| Validators | 264 |
| Validators | 289 |
| Validator Categories | 28 |

@@ -208,6 +208,17 @@ | VE Test Cases | 316 (Validation Engine Enhancement) |

truthound check data.csv --catch-exceptions --max-retries 2 # Resilient mode
truthound check data.csv --parallel --max-workers 8 # DAG parallel execution
truthound check data.csv --return-debug-query --rf complete # Debug query output
truthound compare baseline.csv current.csv # Drift detection
truthound compare big.csv new.csv --sample-size 10000 # Sampled comparison
truthound learn data.csv --categorical-threshold 50 # Custom threshold
truthound scan data.csv # PII scanning
truthound auto-profile data.csv # Profiling
truthound new validator my_validator # Code scaffolding
# Database connections (all core commands support --connection/--table)
truthound check --connection "postgresql://user:pass@host/db" --table users
truthound scan --connection "sqlite:///data.db" --table orders
truthound read --connection "postgresql://host/db" --table users --head 20
truthound read data.csv --schema-only # Inspect schema
truthound compare --source-config drift.yaml # Dual-source drift detection
```

@@ -223,9 +234,12 @@

|---------|-------------|-------------|
| `learn` | Learn schema from data | `--output`, `--no-constraints` |
| `check` | Validate data quality | `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries` |
| `read` | Read and preview data | `--head`, `--sample`, `--columns`, `--schema-only`, `--count-only`, `--format` |
| `learn` | Learn schema from data | `--output`, `--no-constraints`, `--categorical-threshold` |
| `check` | Validate data quality | `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries`, `--parallel`, `--max-workers`, `--pushdown`, `--partial-unexpected-count`, `--return-debug-query`, `--include-unexpected-index` |
| `scan` | Scan for PII | `--format`, `--output` |
| `mask` | Mask sensitive data | `--columns`, `--strategy` (redact/hash/fake), `--strict` |
| `profile` | Generate data profile | `--format`, `--output` |
| `compare` | Detect data drift | `--method` (auto/ks/psi/chi2/js), `--threshold`, `--strict` |
| `compare` | Detect data drift | `--method` (14 methods), `--threshold`, `--sample-size`, `--strict` |
All core commands accept **Data Source Options**: `--connection`/`--conn`, `--table`, `--query`, `--source-config`/`--sc`, `--source-name` for database connectivity (PostgreSQL, MySQL, SQLite, DuckDB, SQL Server, etc.).
### Profiler Commands

@@ -232,0 +246,0 @@

@@ -7,3 +7,3 @@ [build-system]

name = "truthound"
version = "1.3.2"
version = "1.5.0"
description = "Zero-Configuration Data Quality Framework Powered by Polars"

@@ -10,0 +10,0 @@ readme = "README.md"

@@ -48,3 +48,3 @@ <div align="center">

| Test Cases | 8,585+ |
| Validators | 264 |
| Validators | 289 |
| Validator Categories | 28 |

@@ -111,6 +111,17 @@ | VE Test Cases | 316 (Validation Engine Enhancement) |

truthound check data.csv --catch-exceptions --max-retries 2 # Resilient mode
truthound check data.csv --parallel --max-workers 8 # DAG parallel execution
truthound check data.csv --return-debug-query --rf complete # Debug query output
truthound compare baseline.csv current.csv # Drift detection
truthound compare big.csv new.csv --sample-size 10000 # Sampled comparison
truthound learn data.csv --categorical-threshold 50 # Custom threshold
truthound scan data.csv # PII scanning
truthound auto-profile data.csv # Profiling
truthound new validator my_validator # Code scaffolding
# Database connections (all core commands support --connection/--table)
truthound check --connection "postgresql://user:pass@host/db" --table users
truthound scan --connection "sqlite:///data.db" --table orders
truthound read --connection "postgresql://host/db" --table users --head 20
truthound read data.csv --schema-only # Inspect schema
truthound compare --source-config drift.yaml # Dual-source drift detection
```

@@ -126,9 +137,12 @@

|---------|-------------|-------------|
| `learn` | Learn schema from data | `--output`, `--no-constraints` |
| `check` | Validate data quality | `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries` |
| `read` | Read and preview data | `--head`, `--sample`, `--columns`, `--schema-only`, `--count-only`, `--format` |
| `learn` | Learn schema from data | `--output`, `--no-constraints`, `--categorical-threshold` |
| `check` | Validate data quality | `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries`, `--parallel`, `--max-workers`, `--pushdown`, `--partial-unexpected-count`, `--return-debug-query`, `--include-unexpected-index` |
| `scan` | Scan for PII | `--format`, `--output` |
| `mask` | Mask sensitive data | `--columns`, `--strategy` (redact/hash/fake), `--strict` |
| `profile` | Generate data profile | `--format`, `--output` |
| `compare` | Detect data drift | `--method` (auto/ks/psi/chi2/js), `--threshold`, `--strict` |
| `compare` | Detect data drift | `--method` (14 methods), `--threshold`, `--sample-size`, `--strict` |
All core commands accept **Data Source Options**: `--connection`/`--conn`, `--table`, `--query`, `--source-config`/`--sc`, `--source-name` for database connectivity (PostgreSQL, MySQL, SQLite, DuckDB, SQL Server, etc.).
### Profiler Commands

@@ -135,0 +149,0 @@

@@ -59,3 +59,7 @@ """CLI error handling utilities.

# DataSource errors (55-59)
DATASOURCE_ERROR = 55
DATASOURCE_CONNECTION_ERROR = 56
# =============================================================================

@@ -226,2 +230,26 @@ # Exception Classes

class DataSourceError(CLIError):
"""Error with data source connection or configuration."""
def __init__(
self,
message: str,
source_type: str | None = None,
hint: str | None = None,
) -> None:
"""Initialize data source error.
Args:
message: Error message
source_type: Type of data source (e.g., "postgresql", "mysql")
hint: Resolution hint
"""
super().__init__(
message=message,
code=ErrorCode.DATASOURCE_ERROR,
details={"source_type": source_type} if source_type else {},
hint=hint or "Check the connection string, credentials, and table name.",
)
# =============================================================================

@@ -228,0 +256,0 @@ # Error Handler

@@ -340,3 +340,92 @@ """Reusable CLI options and arguments.

# Parallel execution (DAG-based)
ParallelOpt = Annotated[
bool,
typer.Option(
"--parallel/--no-parallel",
help=(
"Enable DAG-based parallel execution. "
"Validators are grouped by dependency level and executed concurrently."
),
),
]
# Max workers for parallel execution
MaxWorkersOpt = Annotated[
int | None,
typer.Option(
"--max-workers",
help=(
"Maximum worker threads for parallel execution. "
"Only effective with --parallel. "
"Defaults to min(32, cpu_count + 4)."
),
min=1,
),
]
# Query pushdown for SQL data sources
PushdownOpt = Annotated[
bool | None,
typer.Option(
"--pushdown/--no-pushdown",
help=(
"Enable query pushdown for SQL data sources. "
"Validation logic is executed server-side when possible. "
"Default: auto-detect based on data source type."
),
),
]
# Execution engine (experimental)
UseEngineOpt = Annotated[
bool,
typer.Option(
"--use-engine/--no-use-engine",
help="Use execution engine for validation (experimental).",
),
]
# Partial unexpected count
PartialUnexpectedCountOpt = Annotated[
int,
typer.Option(
"--partial-unexpected-count",
help="Maximum number of unexpected values in partial list (BASIC+).",
min=0,
),
]
# Include unexpected index
IncludeUnexpectedIndexOpt = Annotated[
bool,
typer.Option(
"--include-unexpected-index",
help="Include row index for each unexpected value in results.",
),
]
# Return debug query
ReturnDebugQueryOpt = Annotated[
bool,
typer.Option(
"--return-debug-query",
help="Include Polars debug query expression in results (COMPLETE level).",
),
]
# Categorical threshold for schema learning
CategoricalThresholdOpt = Annotated[
int,
typer.Option(
"--categorical-threshold",
help=(
"Maximum unique values to treat a column as categorical "
"during schema inference."
),
min=1,
),
]
# =============================================================================

@@ -343,0 +432,0 @@ # Option Groups (for related options)

"""Core CLI commands for Truthound.
This package contains the fundamental CLI commands:
- read: Read and preview data
- learn: Learn schema from data files

@@ -14,2 +15,3 @@ - check: Validate data quality

from truthound.cli_modules.core.read import read_cmd
from truthound.cli_modules.core.learn import learn_cmd

@@ -32,2 +34,3 @@ from truthound.cli_modules.core.check import check_cmd

"""
parent_app.command(name="read")(read_cmd)
parent_app.command(name="learn")(learn_cmd)

@@ -44,2 +47,3 @@ parent_app.command(name="check")(check_cmd)

"register_commands",
"read_cmd",
"learn_cmd",

@@ -46,0 +50,0 @@ "check_cmd",

"""Check command - Validate data quality.
This module implements the `truthound check` command for validating
data quality in files.
This module implements the ``truthound check`` command for validating
data quality in files and database tables.
"""

@@ -14,2 +14,10 @@

from truthound.cli_modules.common.datasource import (
ConnectionOpt,
QueryOpt,
SourceConfigOpt,
SourceNameOpt,
TableOpt,
resolve_datasource,
)
from truthound.cli_modules.common.errors import error_boundary, require_file

@@ -22,5 +30,14 @@ from truthound.cli_modules.common.options import parse_list_callback

file: Annotated[
Path,
typer.Argument(help="Path to the data file"),
],
Optional[Path],
typer.Argument(
help="Path to the data file (CSV, JSON, Parquet, NDJSON)",
),
] = None,
# -- DataSource Options --
connection: ConnectionOpt = None,
table: TableOpt = None,
query: QueryOpt = None,
source_config: SourceConfigOpt = None,
source_name: SourceNameOpt = None,
# -- Validator Options --
validators: Annotated[

@@ -74,2 +91,23 @@ Optional[list[str]],

] = 1000,
partial_unexpected_count: Annotated[
int,
typer.Option(
"--partial-unexpected-count",
help="Maximum number of unexpected values in partial list (BASIC+)",
),
] = 20,
include_unexpected_index: Annotated[
bool,
typer.Option(
"--include-unexpected-index",
help="Include row index for each unexpected value in results",
),
] = False,
return_debug_query: Annotated[
bool,
typer.Option(
"--return-debug-query",
help="Include Polars debug query expression in results (COMPLETE level)",
),
] = False,
catch_exceptions: Annotated[

@@ -109,7 +147,48 @@ bool,

] = None,
# -- Execution Options --
parallel: Annotated[
bool,
typer.Option(
"--parallel/--no-parallel",
help=(
"Enable DAG-based parallel execution. "
"Validators are grouped by dependency level and executed concurrently."
),
),
] = False,
max_workers: Annotated[
Optional[int],
typer.Option(
"--max-workers",
help=(
"Maximum worker threads for parallel execution. "
"Only effective with --parallel. Defaults to min(32, cpu_count + 4)."
),
min=1,
),
] = None,
pushdown: Annotated[
Optional[bool],
typer.Option(
"--pushdown/--no-pushdown",
help=(
"Enable query pushdown for SQL data sources. "
"Validation logic is executed server-side when possible. "
"Default: auto-detect based on data source type."
),
),
] = None,
use_engine: Annotated[
bool,
typer.Option(
"--use-engine/--no-use-engine",
help="Use execution engine for validation (experimental).",
),
] = False,
) -> None:
"""Validate data quality in a file.
"""Validate data quality in a file or database table.
This command runs data quality validators on the specified file and
reports any issues found.
This command runs data quality validators on the specified data
and reports any issues found. Supports file paths, database
connections, and source config files.

@@ -123,9 +202,7 @@ Examples:

truthound check data.csv --result-format complete
truthound check data.csv --rf boolean_only
truthound check data.csv --no-catch-exceptions
truthound check data.csv --max-retries 3
truthound check data.csv --show-exceptions --format json
truthound check --connection "postgresql://user:pass@host/db" --table users
truthound check --conn "sqlite:///data.db" --table orders --pushdown
truthound check --source-config db.yaml --strict
truthound check data.csv --parallel --max-workers 8
truthound check data.csv --exclude-columns first_name,last_name
truthound check data.csv --validator-config '{"unique": {"exclude_columns": ["first_name"]}}'
truthound check data.csv --validator-config config.json
"""

@@ -135,4 +212,12 @@ from truthound.api import check

# Validate files exist
require_file(file)
# Resolve data source
data_path, source = resolve_datasource(
file=file,
connection=connection,
table=table,
query=query,
source_config=source_config,
source_name=source_name,
)
if schema_file:

@@ -192,24 +277,45 @@ require_file(schema_file, "Schema file")

# Build result_format config
rf_config: str | ResultFormatConfig = result_format
if include_unexpected_rows or max_unexpected_rows != 1000:
# Build result_format config — include all fine-grained parameters
has_custom_rf = (
include_unexpected_rows
or max_unexpected_rows != 1000
or partial_unexpected_count != 20
or include_unexpected_index
or return_debug_query
)
rf_config: str | ResultFormatConfig
if has_custom_rf:
rf_config = ResultFormatConfig(
format=ResultFormat.from_string(result_format),
partial_unexpected_count=partial_unexpected_count,
include_unexpected_rows=include_unexpected_rows,
max_unexpected_rows=max_unexpected_rows,
include_unexpected_index=include_unexpected_index,
return_debug_query=return_debug_query,
)
else:
rf_config = result_format
# Build API call kwargs
check_kwargs: dict[str, Any] = {
"validators": validator_list,
"validator_config": v_config,
"min_severity": min_severity,
"schema": schema_file,
"auto_schema": auto_schema,
"result_format": rf_config,
"catch_exceptions": catch_exceptions,
"max_retries": max_retries,
"exclude_columns": exclude_cols,
"parallel": parallel,
"max_workers": max_workers,
"pushdown": pushdown,
"use_engine": use_engine,
}
try:
report = check(
str(file),
validators=validator_list,
validator_config=v_config,
min_severity=min_severity,
schema=schema_file,
auto_schema=auto_schema,
result_format=rf_config,
catch_exceptions=catch_exceptions,
max_retries=max_retries,
exclude_columns=exclude_cols,
)
if source is not None:
report = check(source=source, **check_kwargs)
else:
report = check(data_path, **check_kwargs)
except Exception as e:

@@ -219,2 +325,5 @@ typer.echo(f"Error: {e}", err=True)

# Determine label for HTML report title
report_label = source_name or (source.name if source else str(file))
# Output the report

@@ -236,3 +345,3 @@ if format == "json":

html = generate_html_report(report, title=f"Validation Report: {file.name}")
html = generate_html_report(report, title=f"Validation Report: {report_label}")
output.write_text(html, encoding="utf-8")

@@ -239,0 +348,0 @@ typer.echo(f"HTML report written to {output}")

"""Compare command - Compare datasets for drift.
This module implements the `truthound compare` command for detecting
data drift between two datasets.
This module implements the ``truthound compare`` command for detecting
data drift between two datasets from files or database tables.
"""

@@ -14,3 +14,7 @@

from truthound.cli_modules.common.errors import error_boundary, require_file
from truthound.cli_modules.common.datasource import (
SourceConfigOpt,
resolve_compare_sources,
)
from truthound.cli_modules.common.errors import error_boundary
from truthound.cli_modules.common.options import parse_list_callback

@@ -22,9 +26,16 @@

baseline: Annotated[
Path,
typer.Argument(help="Baseline (reference) data file"),
],
Optional[Path],
typer.Argument(
help="Baseline (reference) data file",
),
] = None,
current: Annotated[
Path,
typer.Argument(help="Current data file to compare"),
],
Optional[Path],
typer.Argument(
help="Current data file to compare",
),
] = None,
# -- DataSource Config (for database-to-database comparison) --
source_config: SourceConfigOpt = None,
# -- Compare Options --
columns: Annotated[

@@ -36,3 +47,10 @@ Optional[list[str]],

str,
typer.Option("--method", "-m", help="Detection method (auto, ks, psi, chi2, js)"),
typer.Option(
"--method",
"-m",
help=(
"Detection method: auto, ks, psi, chi2, js, kl, wasserstein, "
"cvm, anderson, hellinger, bhattacharyya, tv, energy, mmd"
),
),
] = "auto",

@@ -43,2 +61,11 @@ threshold: Annotated[

] = None,
sample_size: Annotated[
Optional[int],
typer.Option(
"--sample-size",
"--sample",
help="Sample size for large datasets. Uses random sampling for faster comparison.",
min=1,
),
] = None,
format: Annotated[

@@ -60,11 +87,29 @@ str,

This command compares a baseline dataset with a current dataset and
detects statistical drift in column distributions.
detects statistical drift in column distributions. Supports file
paths or a --source-config for database-to-database comparison.
Detection Methods:
- auto: Automatically select best method per column
- auto: Automatically select best method per column (recommended)
- ks: Kolmogorov-Smirnov test (numeric)
- psi: Population Stability Index
- psi: Population Stability Index (ML monitoring)
- chi2: Chi-squared test (categorical)
- js: Jensen-Shannon divergence
- js: Jensen-Shannon divergence (any type)
- kl: Kullback-Leibler divergence (numeric)
- wasserstein: Earth Mover's distance (numeric)
- cvm: Cramer-von Mises test (numeric, tail-sensitive)
- anderson: Anderson-Darling test (numeric, extreme values)
- hellinger: Hellinger distance (bounded metric)
- bhattacharyya: Bhattacharyya distance (classification bounds)
- tv: Total Variation distance (max probability diff)
- energy: Energy distance (location/scale)
- mmd: Maximum Mean Discrepancy (high-dimensional)
Source Config Format (YAML):
baseline:
connection: "postgresql://user:pass@host/db"
table: train_data
current:
connection: "postgresql://user:pass@host/db"
table: production_data
Examples:

@@ -75,8 +120,15 @@ truthound compare baseline.csv current.csv

truthound compare old.csv new.csv --columns price,quantity
truthound compare --source-config drift_config.yaml --method ks
truthound compare big_train.csv big_prod.csv --sample-size 10000
"""
from truthound.drift import compare
# Validate files exist
require_file(baseline, "Baseline file")
require_file(current, "Current file")
# Resolve both data sources
(baseline_path, baseline_source), (current_path, current_source) = (
resolve_compare_sources(
baseline=baseline,
current=current,
source_config=source_config,
)
)

@@ -86,9 +138,22 @@ # Parse columns if provided

# Determine inputs for the compare API
baseline_input = (
baseline_source.to_polars_lazyframe().collect()
if baseline_source
else baseline_path
)
current_input = (
current_source.to_polars_lazyframe().collect()
if current_source
else current_path
)
try:
drift_report = compare(
str(baseline),
str(current),
baseline_input,
current_input,
columns=column_list,
method=method,
threshold=threshold,
sample_size=sample_size,
)

@@ -95,0 +160,0 @@ except Exception as e:

@@ -1,5 +0,5 @@

"""Learn command - Learn schema from data files.
"""Learn command - Learn schema from data.
This module implements the `truthound learn` command for inferring
schema from data files.
This module implements the ``truthound learn`` command for inferring
schema from data files and database tables.
"""

@@ -10,7 +10,15 @@

from pathlib import Path
from typing import Annotated
from typing import Annotated, Optional
import typer
from truthound.cli_modules.common.errors import error_boundary, require_file
from truthound.cli_modules.common.datasource import (
ConnectionOpt,
QueryOpt,
SourceConfigOpt,
SourceNameOpt,
TableOpt,
resolve_datasource,
)
from truthound.cli_modules.common.errors import error_boundary

@@ -21,5 +29,14 @@

file: Annotated[
Path,
typer.Argument(help="Path to the data file to learn from"),
],
Optional[Path],
typer.Argument(
help="Path to the data file to learn from",
),
] = None,
# -- DataSource Options --
connection: ConnectionOpt = None,
table: TableOpt = None,
query: QueryOpt = None,
source_config: SourceConfigOpt = None,
source_name: SourceNameOpt = None,
# -- Schema Options --
output: Annotated[

@@ -33,6 +50,17 @@ Path,

] = False,
categorical_threshold: Annotated[
int,
typer.Option(
"--categorical-threshold",
help=(
"Maximum unique values to treat a column as categorical "
"during schema inference (default: 20)"
),
min=1,
),
] = 20,
) -> None:
"""Learn schema from a data file.
"""Learn schema from a data file or database table.
This command analyzes the data file and generates a schema definition
This command analyzes the data and generates a schema definition
that captures column types, constraints, and patterns.

@@ -44,10 +72,31 @@

truthound learn data.csv --no-constraints
truthound learn data.csv --categorical-threshold 50
truthound learn --connection "postgresql://user:pass@host/db" --table users
truthound learn --source-config db.yaml -o db_schema.yaml
"""
from truthound.schema import learn
# Validate file exists
require_file(file)
# Resolve data source
data_path, source = resolve_datasource(
file=file,
connection=connection,
table=table,
query=query,
source_config=source_config,
source_name=source_name,
)
try:
schema = learn(str(file), infer_constraints=not no_constraints)
if source is not None:
schema = learn(
source=source,
infer_constraints=not no_constraints,
categorical_threshold=categorical_threshold,
)
else:
schema = learn(
data_path,
infer_constraints=not no_constraints,
categorical_threshold=categorical_threshold,
)
schema.save(output)

@@ -54,0 +103,0 @@

"""Mask command - Mask sensitive data.
This module implements the `truthound mask` command for masking
sensitive data in files.
This module implements the ``truthound mask`` command for masking
sensitive data in files and database tables.
"""

@@ -14,3 +14,11 @@

from truthound.cli_modules.common.errors import error_boundary, require_file
from truthound.cli_modules.common.datasource import (
ConnectionOpt,
QueryOpt,
SourceConfigOpt,
SourceNameOpt,
TableOpt,
resolve_datasource,
)
from truthound.cli_modules.common.errors import error_boundary
from truthound.cli_modules.common.options import parse_list_callback

@@ -22,9 +30,18 @@

file: Annotated[
Path,
typer.Argument(help="Path to the data file"),
],
Optional[Path],
typer.Argument(
help="Path to the data file (CSV, JSON, Parquet, NDJSON)",
),
] = None,
# -- DataSource Options --
connection: ConnectionOpt = None,
table: TableOpt = None,
query: QueryOpt = None,
source_config: SourceConfigOpt = None,
source_name: SourceNameOpt = None,
# -- Mask Options --
output: Annotated[
Path,
typer.Option("--output", "-o", help="Output file path"),
],
] = ...,
columns: Annotated[

@@ -46,5 +63,5 @@ Optional[list[str]],

) -> None:
"""Mask sensitive data in a file.
"""Mask sensitive data in a file or database table.
This command creates a copy of the data file with sensitive columns
This command creates a copy of the data with sensitive columns
masked using the specified strategy.

@@ -61,3 +78,4 @@

truthound mask data.csv -o masked.csv --strategy hash
truthound mask data.csv -o masked.csv --columns email --strict
truthound mask --connection "postgresql://user:pass@host/db" --table users -o masked.csv
truthound mask --source-config db.yaml -o masked.parquet
"""

@@ -68,4 +86,11 @@ import warnings

# Validate file exists
require_file(file)
# Resolve data source
data_path, source = resolve_datasource(
file=file,
connection=connection,
table=table,
query=query,
source_config=source_config,
source_name=source_name,
)

@@ -79,3 +104,6 @@ # Parse columns if provided

warnings.simplefilter("always", MaskingWarning)
masked_df = mask(str(file), columns=column_list, strategy=strategy, strict=strict)
if source is not None:
masked_df = mask(source=source, columns=column_list, strategy=strategy, strict=strict)
else:
masked_df = mask(data_path, columns=column_list, strategy=strategy, strict=strict)

@@ -82,0 +110,0 @@ # Display any warnings

"""Profile command - Generate data profiles.
This module implements the `truthound profile` command for generating
statistical profiles of data files.
This module implements the ``truthound profile`` command for generating
statistical profiles of data files and database tables.
"""

@@ -14,3 +14,11 @@

from truthound.cli_modules.common.errors import error_boundary, require_file
from truthound.cli_modules.common.datasource import (
ConnectionOpt,
QueryOpt,
SourceConfigOpt,
SourceNameOpt,
TableOpt,
resolve_datasource,
)
from truthound.cli_modules.common.errors import error_boundary

@@ -21,5 +29,14 @@

file: Annotated[
Path,
typer.Argument(help="Path to the data file"),
],
Optional[Path],
typer.Argument(
help="Path to the data file (CSV, JSON, Parquet, NDJSON)",
),
] = None,
# -- DataSource Options --
connection: ConnectionOpt = None,
table: TableOpt = None,
query: QueryOpt = None,
source_config: SourceConfigOpt = None,
source_name: SourceNameOpt = None,
# -- Output Options --
format: Annotated[

@@ -36,3 +53,3 @@ str,

This command analyzes the data file and generates statistics including:
This command analyzes the data and generates statistics including:
- Row and column counts

@@ -48,10 +65,22 @@ - Null ratios per column

truthound profile data.csv -o profile.json
truthound profile --connection "postgresql://user:pass@host/db" --table users
truthound profile --source-config db.yaml --format json
"""
from truthound.api import profile
# Validate file exists
require_file(file)
# Resolve data source
data_path, source = resolve_datasource(
file=file,
connection=connection,
table=table,
query=query,
source_config=source_config,
source_name=source_name,
)
try:
profile_report = profile(str(file))
if source is not None:
profile_report = profile(source=source)
else:
profile_report = profile(data_path)
except Exception as e:

@@ -58,0 +87,0 @@ typer.echo(f"Error: {e}", err=True)

"""Scan command - Scan for PII.
This module implements the `truthound scan` command for detecting
personally identifiable information in data files.
This module implements the ``truthound scan`` command for detecting
personally identifiable information in data files and database tables.
"""

@@ -14,3 +14,11 @@

from truthound.cli_modules.common.errors import error_boundary, require_file
from truthound.cli_modules.common.datasource import (
ConnectionOpt,
QueryOpt,
SourceConfigOpt,
SourceNameOpt,
TableOpt,
resolve_datasource,
)
from truthound.cli_modules.common.errors import error_boundary

@@ -21,5 +29,14 @@

file: Annotated[
Path,
typer.Argument(help="Path to the data file"),
],
Optional[Path],
typer.Argument(
help="Path to the data file (CSV, JSON, Parquet, NDJSON)",
),
] = None,
# -- DataSource Options --
connection: ConnectionOpt = None,
table: TableOpt = None,
query: QueryOpt = None,
source_config: SourceConfigOpt = None,
source_name: SourceNameOpt = None,
# -- Output Options --
format: Annotated[

@@ -36,3 +53,3 @@ str,

This command analyzes data files to detect columns that may contain
This command analyzes data to detect columns that may contain
PII such as names, emails, phone numbers, SSNs, etc.

@@ -44,11 +61,22 @@

truthound scan data.csv -o pii_report.json
truthound scan data.csv --format html -o pii_report.html
truthound scan --connection "postgresql://user:pass@host/db" --table users
truthound scan --source-config db.yaml --format json
"""
from truthound.api import scan
# Validate file exists
require_file(file)
# Resolve data source
data_path, source = resolve_datasource(
file=file,
connection=connection,
table=table,
query=query,
source_config=source_config,
source_name=source_name,
)
try:
pii_report = scan(str(file))
if source is not None:
pii_report = scan(source=source)
else:
pii_report = scan(data_path)
except Exception as e:

@@ -73,4 +101,5 @@ typer.echo(f"Error: {e}", err=True)

report_label = source_name or (source.name if source else str(file))
html = generate_pii_html_report(
pii_report, title=f"PII Scan Report: {file.name}"
pii_report, title=f"PII Scan Report: {report_label}"
)

@@ -77,0 +106,0 @@ output.write_text(html, encoding="utf-8")

@@ -190,3 +190,13 @@ """Factory functions for creating data sources.

if isinstance(data, str):
if data.startswith(("postgresql://", "postgres://")):
sql_prefixes = (
"postgresql://", "postgres://", "mysql://",
"sqlite:", "duckdb:", "mssql://", "sqlserver://",
)
sql_suffixes = (".db", ".duckdb")
is_sql = (
data.startswith(sql_prefixes)
or data.endswith(sql_suffixes)
or "redshift.amazonaws.com" in data
)
if is_sql:
table = kwargs.pop("table", None)

@@ -197,18 +207,4 @@ if not table:

)
from truthound.datasources.sql import PostgreSQLDataSource
return PostgreSQLDataSource.from_connection_string(
data, table=table, **kwargs
)
return get_sql_datasource(data, table=table, **kwargs)
if data.startswith("mysql://"):
table = kwargs.pop("table", None)
if not table:
raise DataSourceError(
"SQL connection string requires 'table' parameter"
)
from truthound.datasources.sql import MySQLDataSource
return MySQLDataSource.from_connection_string(
data, table=table, **kwargs
)
# File doesn't exist

@@ -262,2 +258,11 @@ if not path.exists():

# SQLite: URI format (sqlite:///path) or file path (.db)
if connection_string.startswith("sqlite:"):
# sqlite:///path/to/db or sqlite:///:memory:
db_path = connection_string.replace("sqlite:///", "").replace("sqlite://", "")
if not db_path:
db_path = ":memory:"
from truthound.datasources.sql import SQLiteDataSource
return SQLiteDataSource(table=table, database=db_path, **kwargs)
if connection_string.endswith(".db") or connection_string == ":memory:":

@@ -267,2 +272,24 @@ from truthound.datasources.sql import SQLiteDataSource

# DuckDB: URI format (duckdb:///path) or file suffix (.duckdb)
if connection_string.startswith("duckdb:") or connection_string.endswith(".duckdb"):
try:
from truthound.datasources.sql import DuckDBDataSource
except ImportError:
raise DataSourceError(
"DuckDB support requires duckdb. "
"Install with: pip install duckdb"
)
if DuckDBDataSource is None:
raise DataSourceError(
"DuckDB support requires duckdb. "
"Install with: pip install duckdb"
)
if connection_string.startswith("duckdb:"):
db_path = connection_string.replace("duckdb:///", "").replace("duckdb://", "")
if not db_path:
db_path = ":memory:"
else:
db_path = connection_string
return DuckDBDataSource(table=table, database=db_path, **kwargs)
# Oracle

@@ -312,3 +339,4 @@ if connection_string.startswith("oracle://") or "oracle" in connection_string.lower():

f"Unsupported SQL connection string format: {connection_string}. "
"Supported: postgresql://, mysql://, mssql://, SQLite file path. "
"Supported: postgresql://, mysql://, sqlite:///path, duckdb:///path, "
"mssql://, sqlserver://, *.db, *.duckdb. "
"For BigQuery, Snowflake, Redshift, Databricks, use their specific classes."

@@ -351,4 +379,8 @@ )

return "mysql"
if data.endswith(".db") or data == ":memory:":
if data.startswith("sqlite:") or data.endswith(".db") or data == ":memory:":
return "sqlite"
if data.startswith("duckdb:") or data.endswith(".duckdb"):
return "duckdb"
if data.startswith(("mssql://", "sqlserver://")):
return "sqlserver"
return "unknown"

@@ -440,1 +472,78 @@

return DictDataSource(data)
def get_datasource_from_config(config: dict[str, Any]) -> DataSourceProtocol:
"""Create a DataSource from a configuration dictionary.
Convenience function for creating data sources from parsed
configuration files (JSON/YAML). Delegates to ``get_sql_datasource()``
for connection-string-based configs or constructs backend-specific
classes from individual parameters.
Config styles supported:
Connection string::
{"connection": "postgresql://user:pass@host/db", "table": "users"}
Individual parameters::
{"type": "postgresql", "host": "localhost", "database": "mydb",
"user": "postgres", "password": "...", "table": "users"}
Args:
config: Configuration dictionary with connection details.
Returns:
Configured DataSource instance.
Raises:
DataSourceError: If the config is invalid or backend unavailable.
"""
connection = config.get("connection")
table = config.get("table")
query = config.get("query")
source_type = config.get("type")
# Style 1: Connection string
if connection:
if not table and not query:
raise DataSourceError(
"Config with 'connection' requires 'table' or 'query'."
)
return get_sql_datasource(
connection, table=table or "__query__", query=query
)
# Style 2: Individual parameters
if not source_type:
raise DataSourceError(
"Config must have either 'connection' or 'type' field."
)
if not table and not query:
raise DataSourceError(
f"Config for type '{source_type}' requires 'table' or 'query'."
)
from truthound.datasources.sql import get_available_sources
available = get_available_sources()
source_cls = available.get(source_type)
if source_cls is None:
available_names = [k for k, v in available.items() if v is not None]
raise DataSourceError(
f"Data source type '{source_type}' is not available. "
f"Available: {', '.join(available_names)}."
)
# Build kwargs (exclude meta keys)
meta_keys = {"type", "table", "query", "name"}
kwargs: dict[str, Any] = {"table": table} if table else {}
if query:
kwargs["query"] = query
for key, value in config.items():
if key not in meta_keys:
kwargs[key] = value
return source_cls(**kwargs)

@@ -183,2 +183,49 @@ """Tests for check command --exclude-columns and --validator-config options."""

class TestCheckDatasourceOptions:
"""Tests for --connection, --table, and --source-config on check."""
def test_check_with_connection_string(self, runner, app):
"""--connection + --table passes source to API."""
with (
patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
patch("truthound.api.check") as mock_check,
):
import polars as pl
mock_source = MagicMock()
mock_source.name = "users"
mock_source.to_polars_lazyframe.return_value = pl.LazyFrame({"id": [1]})
mock_sql.return_value = mock_source
mock_report = MagicMock()
mock_report.has_issues = False
mock_report.exception_summary = None
mock_check.return_value = mock_report
result = runner.invoke(app, [
"--connection", "postgresql://user:pass@host/db",
"--table", "users",
])
assert result.exit_code == 0
# check() should be called with source= keyword
call_kwargs = mock_check.call_args
assert call_kwargs.kwargs.get("source") is not None
def test_check_file_and_connection_mutually_exclusive(self, runner, app, sample_csv):
"""Both file and --connection raises error."""
result = runner.invoke(app, [
str(sample_csv),
"--connection", "postgresql://host/db",
"--table", "t",
])
assert result.exit_code != 0
def test_check_connection_without_table_error(self, runner, app):
"""--connection without --table raises error."""
result = runner.invoke(app, [
"--connection", "postgresql://host/db",
])
assert result.exit_code != 0
class TestCombinedOptions:

@@ -185,0 +232,0 @@ """Tests for combined --exclude-columns and --validator-config."""

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "truthound"
version = "1.3.2"
description = "Zero-Configuration Data Quality Framework Powered by Polars"
readme = "README.md"
license = "Apache-2.0"
requires-python = ">=3.11"
authors = [
{ name = "seadonggyun4", email = "seadonggyun4@gmail.com" }
]
keywords = [
"data-quality",
"data-validation",
"polars",
"pii-detection",
"data-masking",
]
classifiers = [
"Development Status :: 3 - Alpha",
"Intended Audience :: Developers",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Topic :: Scientific/Engineering",
"Topic :: Software Development :: Libraries :: Python Modules",
"Typing :: Typed",
]
dependencies = [
"polars>=1.0.0",
"pyyaml>=6.0.0",
"rich>=13.0.0",
"typer>=0.12.0",
]
[project.optional-dependencies]
# Report generation
reports = [
"jinja2>=3.0.0",
]
# Statistical drift detection
drift = [
"scipy>=1.10.0",
]
# Anomaly detection with ML
anomaly = [
"scipy>=1.10.0",
"scikit-learn>=1.3.0",
]
# Cloud storage backends
s3 = [
"boto3>=1.26.0",
]
gcs = [
"google-cloud-storage>=2.0.0",
]
azure = [
"azure-storage-blob>=12.0.0",
]
# Database storage backend
database = [
"sqlalchemy>=2.0.0",
]
# All storage backends
stores = [
"boto3>=1.26.0",
"google-cloud-storage>=2.0.0",
"azure-storage-blob>=12.0.0",
"sqlalchemy>=2.0.0",
]
# DuckDB database
duckdb = [
"duckdb>=1.0.0",
]
# NoSQL databases
mongodb = [
"motor>=3.0.0",
]
elasticsearch = [
"elasticsearch[async]>=8.0.0",
]
nosql = [
"motor>=3.0.0",
"elasticsearch[async]>=8.0.0",
]
# Streaming platforms
kafka = [
"aiokafka>=0.9.0",
]
streaming = [
"aiokafka>=0.9.0",
]
# All async datasources
async-datasources = [
"motor>=3.0.0",
"elasticsearch[async]>=8.0.0",
"aiokafka>=0.9.0",
]
# Interactive dashboard (Phase 8)
dashboard = [
"reflex>=0.4.0",
]
# PDF export support
pdf = [
"weasyprint>=60.0",
]
# Performance optimization
perf = [
"xxhash>=3.4.0",
]
# Full installation with all optional dependencies
all = [
"jinja2>=3.0.0",
"pandas>=2.0.0",
"scipy>=1.10.0",
"scikit-learn>=1.3.0",
"boto3>=1.26.0",
"google-cloud-storage>=2.0.0",
"azure-storage-blob>=12.0.0",
"sqlalchemy>=2.0.0",
"duckdb>=1.0.0",
"reflex>=0.4.0",
"weasyprint>=60.0",
"motor>=3.0.0",
"elasticsearch[async]>=8.0.0",
"aiokafka>=0.9.0",
"xxhash>=3.4.0",
]
# Development dependencies
dev = [
"pytest>=8.0.0",
"pytest-cov>=4.0.0",
"pytest-asyncio>=0.23.0",
"ruff>=0.4.0",
"mypy>=1.10.0",
"pandas>=2.0.0",
"scipy>=1.10.0",
"scikit-learn>=1.3.0",
]
[project.scripts]
truthound = "truthound.cli:app"
[project.urls]
Homepage = "https://github.com/seadonggyun4/Truthound"
Repository = "https://github.com/seadonggyun4/Truthound"
Issues = "https://github.com/seadonggyun4/Truthound/issues"
[tool.hatch.build.targets.wheel]
packages = ["src/truthound"]
[tool.hatch.envs.default]
dependencies = [
"pytest>=8.0.0",
"pytest-cov>=4.0.0",
"ruff>=0.4.0",
"mypy>=1.10.0",
"pandas>=2.0.0",
]
[tool.hatch.envs.default.scripts]
test = "pytest {args:tests}"
test-cov = "pytest --cov=truthound --cov-report=term-missing {args:tests}"
lint = "ruff check src tests"
format = "ruff format src tests"
typecheck = "mypy src"
[tool.ruff]
target-version = "py311"
line-length = 100
src = ["src", "tests"]
[tool.ruff.lint]
select = [
"E", # pycodestyle errors
"W", # pycodestyle warnings
"F", # Pyflakes
"I", # isort
"UP", # pyupgrade
"B", # flake8-bugbear
"SIM", # flake8-simplify
"TCH", # flake8-type-checking
]
ignore = [
"E501", # line too long (handled by formatter)
]
[tool.ruff.lint.isort]
known-first-party = ["truthound"]
[tool.mypy]
python_version = "3.11"
strict = true
warn_return_any = true
warn_unused_ignores = true
[tool.pytest.ini_options]
testpaths = ["tests"]
pythonpath = ["src"]
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
"e2e: marks tests as end-to-end tests",
"scale_100m: marks tests as 100M+ scale tests (run with '-m scale_100m')",
]