truthound
Advanced tools
| # truthound read | ||
| Read and preview data from files or database connections. Supports row/column selection, multiple output formats, and schema inspection. | ||
| ## Synopsis | ||
| ```bash | ||
| truthound read [FILE] [OPTIONS] | ||
| ``` | ||
| ## Arguments | ||
| | Argument | Required | Description | | ||
| |----------|----------|-------------| | ||
| | `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON) | | ||
| ## Data Source Options | ||
| | Option | Short | Default | Description | | ||
| |--------|-------|---------|-------------| | ||
| | `--connection` | `--conn` | None | Database connection string | | ||
| | `--table` | | None | Database table name | | ||
| | `--query` | | None | SQL query (alternative to `--table`) | | ||
| | `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) | | ||
| | `--source-name` | | None | Custom label for the data source | | ||
| ## Selection Options | ||
| | Option | Short | Default | Description | | ||
| |--------|-------|---------|-------------| | ||
| | `--head` | `-n` | None | Show only the first N rows | | ||
| | `--sample` | `-s` | None | Random sample of N rows | | ||
| | `--columns` | `-c` | None | Columns to include (comma-separated) | | ||
| ## Output Options | ||
| | Option | Short | Default | Description | | ||
| |--------|-------|---------|-------------| | ||
| | `--format` | `-f` | `table` | Output format (table, csv, json, parquet, ndjson) | | ||
| | `--output` | `-o` | None | Output file path | | ||
| ## Inspection Options | ||
| | Option | Short | Default | Description | | ||
| |--------|-------|---------|-------------| | ||
| | `--schema-only` | | `false` | Show only column names and types | | ||
| | `--count-only` | | `false` | Show only the row count | | ||
| ## Examples | ||
| ### Basic Reading | ||
| ```bash | ||
| truthound read data.csv | ||
| truthound read data.parquet --head 20 | ||
| truthound read data.csv --columns id,name,age | ||
| ``` | ||
| ### Database Reading | ||
| ```bash | ||
| truthound read --connection "postgresql://user:pass@host/db" --table users | ||
| truthound read --connection "sqlite:///data.db" --table orders --head 10 | ||
| truthound read --source-config db.yaml --sample 1000 | ||
| ``` | ||
| ### Schema Inspection | ||
| ```bash | ||
| truthound read data.csv --schema-only | ||
| truthound read --connection "postgresql://host/db" --table users --schema-only | ||
| ``` | ||
| ### Format Conversion | ||
| ```bash | ||
| truthound read data.csv --format json -o output.json | ||
| truthound read data.csv --format parquet -o output.parquet | ||
| truthound read data.csv --format csv --head 100 | ||
| ``` | ||
| ### Row Count | ||
| ```bash | ||
| truthound read data.csv --count-only | ||
| ``` | ||
| ## Related Commands | ||
| - [`check`](check.md) - Validate data quality | ||
| - [`profile`](profile.md) - Generate data profile | ||
| - [`learn`](learn.md) - Learn schema from data | ||
| ## See Also | ||
| - [Python API: th.read()](../../python-api/core-functions.md#thread) | ||
| - [Data Source Options](../../guides/datasources/cli-datasource-guide.md) |
| # CLI Data Source Guide | ||
| All Truthound CLI commands support reading data from databases and external sources in addition to local files. This guide covers the shared data source options available across all core commands. | ||
| ## Overview | ||
| Truthound CLI commands accept data from three input modes: | ||
| 1. **File mode** (default): Pass a file path as a positional argument | ||
| 2. **Connection string mode**: Use `--connection` and `--table` (or `--query`) to connect to a database | ||
| 3. **Source config mode**: Use `--source-config` to load connection details from a JSON or YAML file | ||
| These modes are mutually exclusive. If a file argument is provided alongside connection options, the file takes precedence. | ||
| ## Data Source Options | ||
| The following options are available on all core commands (`check`, `scan`, `mask`, `profile`, `learn`, `compare`, `read`): | ||
| | Option | Short | Description | | ||
| |--------|-------|-------------| | ||
| | `--connection` | `--conn` | Database connection string (see formats below) | | ||
| | `--table` | | Database table name to read | | ||
| | `--query` | | SQL query to execute (alternative to `--table`) | | ||
| | `--source-config` | `--sc` | Path to a data source config file (JSON or YAML) | | ||
| | `--source-name` | | Custom label for the data source (used in reports) | | ||
| ## Connection String Formats | ||
| ### PostgreSQL | ||
| ``` | ||
| postgresql://user:password@host:5432/dbname | ||
| ``` | ||
| Install the PostgreSQL backend: | ||
| ```bash | ||
| pip install truthound[postgresql] | ||
| ``` | ||
| ### MySQL | ||
| ``` | ||
| mysql://user:password@host:3306/dbname | ||
| ``` | ||
| Install the MySQL backend: | ||
| ```bash | ||
| pip install truthound[mysql] | ||
| ``` | ||
| ### SQLite | ||
| ``` | ||
| sqlite:///path/to/database.db | ||
| sqlite:///./relative/path.db | ||
| ``` | ||
| SQLite is included by default; no extra install is needed. | ||
| ### DuckDB | ||
| ``` | ||
| duckdb:///path/to/database.duckdb | ||
| duckdb:///:memory: | ||
| ``` | ||
| Install the DuckDB backend: | ||
| ```bash | ||
| pip install truthound[duckdb] | ||
| ``` | ||
| ### Microsoft SQL Server | ||
| ``` | ||
| mssql://user:password@host:1433/dbname | ||
| ``` | ||
| Install the SQL Server backend: | ||
| ```bash | ||
| pip install truthound[mssql] | ||
| ``` | ||
| ## Source Config File Format | ||
| For repeatable or complex connection setups, use a source config file with `--source-config`. | ||
| ### JSON Example | ||
| ```json | ||
| { | ||
| "type": "postgresql", | ||
| "connection": "postgresql://user:password@host:5432/dbname", | ||
| "table": "users", | ||
| "source_name": "production-users" | ||
| } | ||
| ``` | ||
| ### YAML Example | ||
| ```yaml | ||
| type: postgresql | ||
| connection: postgresql://user:password@host:5432/dbname | ||
| table: users | ||
| source_name: production-users | ||
| ``` | ||
| ### Using a SQL Query | ||
| ```yaml | ||
| type: postgresql | ||
| connection: postgresql://user:password@host:5432/dbname | ||
| query: "SELECT id, name, email FROM users WHERE active = true" | ||
| source_name: active-users | ||
| ``` | ||
| ## Dual-Source Config (for `compare`) | ||
| The `compare` command accepts two data sources. You can provide a source config file that defines both `baseline` and `current`: | ||
| ```yaml | ||
| baseline: | ||
| type: postgresql | ||
| connection: postgresql://user:pass@host/db | ||
| table: users_baseline | ||
| current: | ||
| type: postgresql | ||
| connection: postgresql://user:pass@host/db | ||
| table: users_current | ||
| ``` | ||
| Usage: | ||
| ```bash | ||
| truthound compare --source-config compare_sources.yaml --method psi | ||
| ``` | ||
| Alternatively, you can specify individual files or connections for each source on the command line. | ||
| ## Per-Backend Install Hints | ||
| Truthound uses optional dependency groups for database backends. Install only what you need: | ||
| | Backend | Install Command | | ||
| |---------|----------------| | ||
| | PostgreSQL | `pip install truthound[postgresql]` | | ||
| | MySQL | `pip install truthound[mysql]` | | ||
| | DuckDB | `pip install truthound[duckdb]` | | ||
| | SQL Server | `pip install truthound[mssql]` | | ||
| | BigQuery | `pip install truthound[bigquery]` | | ||
| | Snowflake | `pip install truthound[snowflake]` | | ||
| | All databases | `pip install truthound[databases]` | | ||
| SQLite support is included in the base install. | ||
| ## Security Considerations | ||
| **Do not put passwords directly in CLI history.** Connection strings with embedded credentials are visible in shell history and process listings. | ||
| Recommended practices: | ||
| 1. **Use environment variables:** | ||
| ```bash | ||
| export DB_CONN="postgresql://user:password@host/db" | ||
| truthound check --connection "$DB_CONN" --table users | ||
| ``` | ||
| 2. **Use source config files** with restricted file permissions: | ||
| ```bash | ||
| chmod 600 db_config.yaml | ||
| truthound check --source-config db_config.yaml | ||
| ``` | ||
| 3. **Use `.pgpass` or equivalent** credential files supported by your database client. | ||
| 4. **Avoid inline passwords** in CI/CD pipelines. Use secrets management (GitHub Secrets, Vault, etc.) and inject via environment variables. | ||
| ## Examples for Each Command | ||
| ### check | ||
| ```bash | ||
| # Validate a PostgreSQL table | ||
| truthound check --connection "postgresql://user:pass@host/db" --table orders | ||
| # Validate with source config | ||
| truthound check --source-config prod_db.yaml --strict | ||
| ``` | ||
| ### scan | ||
| ```bash | ||
| # Scan a database table for PII | ||
| truthound scan --connection "postgresql://user:pass@host/db" --table customers | ||
| ``` | ||
| ### mask | ||
| ```bash | ||
| # Mask PII in a database table and write to a file | ||
| truthound mask --connection "sqlite:///data.db" --table users -o masked_users.csv | ||
| ``` | ||
| ### profile | ||
| ```bash | ||
| # Profile a database table | ||
| truthound profile --connection "postgresql://user:pass@host/db" --table transactions | ||
| ``` | ||
| ### learn | ||
| ```bash | ||
| # Learn schema from a database table | ||
| truthound learn --connection "postgresql://user:pass@host/db" --table products -o schema.yaml | ||
| ``` | ||
| ### compare | ||
| ```bash | ||
| # Compare two database tables | ||
| truthound compare --source-config compare_sources.yaml --method psi --strict | ||
| ``` | ||
| ### read | ||
| ```bash | ||
| # Preview a database table | ||
| truthound read --connection "postgresql://user:pass@host/db" --table users --head 20 | ||
| # Run a SQL query and export as CSV | ||
| truthound read --connection "sqlite:///data.db" --query "SELECT * FROM orders WHERE total > 100" --format csv -o high_orders.csv | ||
| ``` | ||
| ## See Also | ||
| - [Data Sources Overview](index.md) | ||
| - [Database Connections](databases.md) | ||
| - [CLI Core Commands](../../cli/core/index.md) |
| """Shared DataSource resolution for CLI commands. | ||
| This module provides a unified abstraction layer that resolves CLI | ||
| options (file path, connection string, or config file) into either | ||
| a file path string or a BaseDataSource instance. All core CLI commands | ||
| use this layer for consistent data source handling. | ||
| Architecture: | ||
| CLI options → resolve_datasource() → (file_path | None, source | None) | ||
| ↓ ↓ | ||
| api.func(data=...) api.func(source=...) | ||
| """ | ||
| from __future__ import annotations | ||
| import json | ||
| import logging | ||
| from pathlib import Path | ||
| from typing import TYPE_CHECKING, Annotated, Any, Optional | ||
| import typer | ||
| from truthound.cli_modules.common.errors import ( | ||
| CLIError, | ||
| DataSourceError, | ||
| ErrorCode, | ||
| FileNotFoundError, | ||
| require_file, | ||
| ) | ||
| if TYPE_CHECKING: | ||
| from truthound.datasources.base import BaseDataSource | ||
| logger = logging.getLogger(__name__) | ||
| # ============================================================================= | ||
| # Reusable Annotated CLI Options | ||
| # ============================================================================= | ||
| ConnectionOpt = Annotated[ | ||
| Optional[str], | ||
| typer.Option( | ||
| "--connection", | ||
| "--conn", | ||
| help=( | ||
| "Database connection string. " | ||
| "Examples: postgresql://user:pass@host:5432/db, " | ||
| "mysql://user:pass@host/db, sqlite:///path/to.db" | ||
| ), | ||
| ), | ||
| ] | ||
| TableOpt = Annotated[ | ||
| Optional[str], | ||
| typer.Option( | ||
| "--table", | ||
| help="Database table name (required with --connection for SQL sources)", | ||
| ), | ||
| ] | ||
| QueryOpt = Annotated[ | ||
| Optional[str], | ||
| typer.Option( | ||
| "--query", | ||
| help="SQL query to validate (alternative to --table)", | ||
| ), | ||
| ] | ||
| SourceConfigOpt = Annotated[ | ||
| Optional[Path], | ||
| typer.Option( | ||
| "--source-config", | ||
| "--sc", | ||
| help=( | ||
| "Path to data source configuration file (JSON/YAML). " | ||
| "See docs for config file format." | ||
| ), | ||
| ), | ||
| ] | ||
| SourceNameOpt = Annotated[ | ||
| Optional[str], | ||
| typer.Option( | ||
| "--source-name", | ||
| help="Custom name for the data source (used in report labels)", | ||
| ), | ||
| ] | ||
| # ============================================================================= | ||
| # DataSource Resolution | ||
| # ============================================================================= | ||
| def resolve_datasource( | ||
| file: Path | None = None, | ||
| connection: str | None = None, | ||
| table: str | None = None, | ||
| query: str | None = None, | ||
| source_config: Path | None = None, | ||
| source_name: str | None = None, | ||
| ) -> tuple[str | None, "BaseDataSource | None"]: | ||
| """Resolve CLI options into a file path or BaseDataSource instance. | ||
| This is the central resolution function used by all CLI commands. | ||
| It enforces mutual exclusivity between input modes and validates | ||
| required parameters for each mode. | ||
| Args: | ||
| file: Path to a data file (CSV, JSON, Parquet, etc.) | ||
| connection: Database connection string | ||
| table: Database table name (for SQL sources) | ||
| query: SQL query string (alternative to table) | ||
| source_config: Path to a JSON/YAML data source config file | ||
| source_name: Custom label for the data source | ||
| Returns: | ||
| A tuple of (file_path, source) where exactly one is non-None. | ||
| Raises: | ||
| DataSourceError: If inputs are invalid or conflicting. | ||
| FileNotFoundError: If the specified file does not exist. | ||
| """ | ||
| _validate_input_exclusivity(file, connection, source_config) | ||
| # Mode 1: Source config file | ||
| if source_config is not None: | ||
| require_file(source_config, "Source config file") | ||
| config = parse_source_config(source_config) | ||
| source = create_datasource_from_config(config) | ||
| if source_name: | ||
| _set_source_name(source, source_name) | ||
| return None, source | ||
| # Mode 2: Connection string | ||
| if connection is not None: | ||
| source = _create_from_connection(connection, table, query, source_name) | ||
| return None, source | ||
| # Mode 3: File path (legacy, default) | ||
| if file is not None: | ||
| require_file(file) | ||
| return str(file), None | ||
| # No input provided | ||
| raise DataSourceError( | ||
| "No data input specified.", | ||
| hint=( | ||
| "Provide one of:\n" | ||
| " - A file path: truthound <command> data.csv\n" | ||
| " - A connection: truthound <command> --connection 'postgresql://...' --table users\n" | ||
| " - A config file: truthound <command> --source-config db.yaml" | ||
| ), | ||
| ) | ||
| def resolve_compare_sources( | ||
| baseline: Path | None = None, | ||
| current: Path | None = None, | ||
| source_config: Path | None = None, | ||
| ) -> tuple[ | ||
| tuple[str | None, "BaseDataSource | None"], | ||
| tuple[str | None, "BaseDataSource | None"], | ||
| ]: | ||
| """Resolve inputs for the compare command (dual-source). | ||
| Args: | ||
| baseline: Baseline file path | ||
| current: Current file path | ||
| source_config: Config file with baseline/current sections | ||
| Returns: | ||
| Tuple of (baseline_resolution, current_resolution), | ||
| each a (file_path | None, source | None) pair. | ||
| Raises: | ||
| DataSourceError: If inputs are invalid or conflicting. | ||
| """ | ||
| if source_config is not None: | ||
| if baseline is not None or current is not None: | ||
| raise DataSourceError( | ||
| "Cannot specify both file paths and --source-config for compare.", | ||
| hint="Use either positional file args OR --source-config, not both.", | ||
| ) | ||
| require_file(source_config, "Source config file") | ||
| config = parse_source_config(source_config) | ||
| baseline_cfg = config.get("baseline") | ||
| current_cfg = config.get("current") | ||
| if not baseline_cfg or not current_cfg: | ||
| raise DataSourceError( | ||
| "Compare source config must have 'baseline' and 'current' sections.", | ||
| hint=( | ||
| "Example config:\n" | ||
| " baseline:\n" | ||
| " connection: postgresql://...\n" | ||
| " table: train_data\n" | ||
| " current:\n" | ||
| " connection: postgresql://...\n" | ||
| " table: prod_data" | ||
| ), | ||
| ) | ||
| baseline_source = create_datasource_from_config(baseline_cfg) | ||
| current_source = create_datasource_from_config(current_cfg) | ||
| return (None, baseline_source), (None, current_source) | ||
| # File-based path | ||
| if baseline is None or current is None: | ||
| raise DataSourceError( | ||
| "Both baseline and current data must be specified.", | ||
| hint=( | ||
| "Provide two file paths:\n" | ||
| " truthound compare baseline.csv current.csv\n" | ||
| "Or use --source-config with baseline/current sections." | ||
| ), | ||
| ) | ||
| require_file(baseline, "Baseline file") | ||
| require_file(current, "Current file") | ||
| return (str(baseline), None), (str(current), None) | ||
| # ============================================================================= | ||
| # Config File Parsing | ||
| # ============================================================================= | ||
| def parse_source_config(config_path: Path) -> dict[str, Any]: | ||
| """Parse a data source configuration file (JSON or YAML). | ||
| Supported formats: | ||
| - JSON (.json) | ||
| - YAML (.yaml, .yml) | ||
| Config schema for single source: | ||
| type: postgresql | ||
| connection: "postgresql://user:pass@host:5432/db" | ||
| table: users | ||
| Config schema for compare (dual source): | ||
| baseline: | ||
| connection: "postgresql://..." | ||
| table: train_data | ||
| current: | ||
| connection: "postgresql://..." | ||
| table: prod_data | ||
| Args: | ||
| config_path: Path to the configuration file. | ||
| Returns: | ||
| Parsed configuration dictionary. | ||
| Raises: | ||
| DataSourceError: If the file cannot be parsed. | ||
| """ | ||
| content = config_path.read_text(encoding="utf-8") | ||
| suffix = config_path.suffix.lower() | ||
| if suffix == ".json": | ||
| try: | ||
| config = json.loads(content) | ||
| except json.JSONDecodeError as e: | ||
| raise DataSourceError( | ||
| f"Invalid JSON in source config: {e}", | ||
| hint=f"Check the syntax of {config_path}", | ||
| ) | ||
| elif suffix in (".yaml", ".yml"): | ||
| try: | ||
| import yaml | ||
| config = yaml.safe_load(content) | ||
| except ImportError: | ||
| raise DataSourceError( | ||
| "YAML config requires PyYAML.", | ||
| hint="Install with: pip install pyyaml", | ||
| ) | ||
| except Exception as e: | ||
| raise DataSourceError( | ||
| f"Invalid YAML in source config: {e}", | ||
| hint=f"Check the syntax of {config_path}", | ||
| ) | ||
| else: | ||
| raise DataSourceError( | ||
| f"Unsupported config file format: {suffix}", | ||
| hint="Use .json, .yaml, or .yml", | ||
| ) | ||
| if not isinstance(config, dict): | ||
| raise DataSourceError( | ||
| "Source config must be a JSON/YAML object (dictionary).", | ||
| hint=f"Check {config_path}", | ||
| ) | ||
| return config | ||
| def create_datasource_from_config(config: dict[str, Any]) -> "BaseDataSource": | ||
| """Create a BaseDataSource from a parsed configuration dictionary. | ||
| Supports two config styles: | ||
| 1. Connection string style: | ||
| {"connection": "postgresql://...", "table": "users"} | ||
| 2. Individual parameters style: | ||
| {"type": "postgresql", "host": "localhost", "database": "mydb", | ||
| "user": "postgres", "password": "...", "table": "users"} | ||
| Args: | ||
| config: Configuration dictionary. | ||
| Returns: | ||
| Configured BaseDataSource instance. | ||
| Raises: | ||
| DataSourceError: If the config is invalid or the backend is unavailable. | ||
| """ | ||
| from truthound.datasources.factory import get_sql_datasource | ||
| from truthound.datasources.sql import get_available_sources | ||
| connection = config.get("connection") | ||
| table = config.get("table") | ||
| query = config.get("query") | ||
| source_type = config.get("type") | ||
| # Style 1: Connection string | ||
| if connection: | ||
| if not table and not query: | ||
| raise DataSourceError( | ||
| "Config with 'connection' requires 'table' or 'query'.", | ||
| hint="Add a 'table' or 'query' field to your config file.", | ||
| ) | ||
| try: | ||
| return get_sql_datasource( | ||
| connection, table=table or "__query__", query=query | ||
| ) | ||
| except Exception as e: | ||
| raise DataSourceError( | ||
| f"Failed to create data source from connection string: {e}", | ||
| source_type=source_type, | ||
| ) | ||
| # Style 2: Individual parameters with type | ||
| if not source_type: | ||
| raise DataSourceError( | ||
| "Config must have either 'connection' or 'type' field.", | ||
| hint=( | ||
| "Example:\n" | ||
| " connection: postgresql://user:pass@host:5432/db\n" | ||
| " table: users\n" | ||
| "Or:\n" | ||
| " type: postgresql\n" | ||
| " host: localhost\n" | ||
| " database: mydb\n" | ||
| " table: users" | ||
| ), | ||
| ) | ||
| if not table and not query: | ||
| raise DataSourceError( | ||
| f"Config for type '{source_type}' requires 'table' or 'query'.", | ||
| ) | ||
| available = get_available_sources() | ||
| source_cls = available.get(source_type) | ||
| if source_cls is None: | ||
| available_names = [k for k, v in available.items() if v is not None] | ||
| raise DataSourceError( | ||
| f"Data source type '{source_type}' is not available.", | ||
| source_type=source_type, | ||
| hint=( | ||
| f"Available types: {', '.join(available_names)}. " | ||
| f"You may need to install the required driver." | ||
| ), | ||
| ) | ||
| # Build constructor kwargs from config (exclude meta keys) | ||
| meta_keys = {"type", "table", "query", "name"} | ||
| kwargs: dict[str, Any] = {} | ||
| if table: | ||
| kwargs["table"] = table | ||
| if query: | ||
| kwargs["query"] = query | ||
| for key, value in config.items(): | ||
| if key not in meta_keys: | ||
| kwargs[key] = value | ||
| try: | ||
| return source_cls(**kwargs) | ||
| except TypeError as e: | ||
| raise DataSourceError( | ||
| f"Invalid config for '{source_type}': {e}", | ||
| source_type=source_type, | ||
| hint=f"Check the supported parameters for {source_type} data source.", | ||
| ) | ||
| except Exception as e: | ||
| raise DataSourceError( | ||
| f"Failed to create '{source_type}' data source: {e}", | ||
| source_type=source_type, | ||
| ) | ||
| # ============================================================================= | ||
| # Internal Helpers | ||
| # ============================================================================= | ||
| def _validate_input_exclusivity( | ||
| file: Path | None, | ||
| connection: str | None, | ||
| source_config: Path | None, | ||
| ) -> None: | ||
| """Validate that at most one data input mode is specified.""" | ||
| modes = [] | ||
| if file is not None: | ||
| modes.append("file argument") | ||
| if connection is not None: | ||
| modes.append("--connection") | ||
| if source_config is not None: | ||
| modes.append("--source-config") | ||
| if len(modes) > 1: | ||
| raise DataSourceError( | ||
| f"Conflicting data inputs: {' and '.join(modes)}.", | ||
| hint="Specify only one: a file path, --connection, or --source-config.", | ||
| ) | ||
| def _create_from_connection( | ||
| connection: str, | ||
| table: str | None, | ||
| query: str | None, | ||
| source_name: str | None, | ||
| ) -> "BaseDataSource": | ||
| """Create a BaseDataSource from a connection string.""" | ||
| from truthound.datasources.factory import get_sql_datasource | ||
| if not table and not query: | ||
| raise DataSourceError( | ||
| "--table or --query is required with --connection.", | ||
| hint=( | ||
| "Example:\n" | ||
| " --connection 'postgresql://user:pass@host/db' --table users\n" | ||
| " --connection 'sqlite:///data.db' --query 'SELECT * FROM orders'" | ||
| ), | ||
| ) | ||
| try: | ||
| target = table or "__query__" | ||
| source = get_sql_datasource(connection, table=target, query=query) | ||
| except ImportError as e: | ||
| _raise_driver_hint(connection, e) | ||
| except Exception as e: | ||
| raise DataSourceError( | ||
| f"Failed to connect: {e}", | ||
| hint="Check the connection string format and database availability.", | ||
| ) | ||
| if source_name: | ||
| _set_source_name(source, source_name) | ||
| return source | ||
| def _set_source_name(source: "BaseDataSource", name: str) -> None: | ||
| """Attempt to set a custom name on a data source.""" | ||
| if hasattr(source, "config") and hasattr(source.config, "name"): | ||
| try: | ||
| source.config.name = name | ||
| except (AttributeError, TypeError): | ||
| pass | ||
| def _raise_driver_hint(connection: str, error: ImportError) -> None: | ||
| """Raise a DataSourceError with install hints based on connection string.""" | ||
| conn_lower = connection.lower() | ||
| hints = { | ||
| "postgresql": ("psycopg2-binary", "pip install truthound[postgresql]"), | ||
| "postgres": ("psycopg2-binary", "pip install truthound[postgresql]"), | ||
| "mysql": ("pymysql", "pip install truthound[mysql]"), | ||
| "oracle": ("oracledb", "pip install oracledb"), | ||
| "mssql": ("pyodbc", "pip install pyodbc"), | ||
| "sqlserver": ("pyodbc", "pip install pyodbc"), | ||
| "bigquery": ("google-cloud-bigquery", "pip install truthound[bigquery]"), | ||
| "snowflake": ("snowflake-connector-python", "pip install truthound[snowflake]"), | ||
| "redshift": ("redshift-connector", "pip install truthound[redshift]"), | ||
| "databricks": ("databricks-sql-connector", "pip install truthound[databricks]"), | ||
| "duckdb": ("duckdb", "pip install duckdb"), | ||
| } | ||
| for prefix, (pkg, install_cmd) in hints.items(): | ||
| if prefix in conn_lower: | ||
| raise DataSourceError( | ||
| f"Missing driver for {prefix}: {error}", | ||
| source_type=prefix, | ||
| hint=f"Install with: {install_cmd}", | ||
| ) | ||
| raise DataSourceError( | ||
| f"Missing driver: {error}", | ||
| hint="Check that the required database driver is installed.", | ||
| ) |
| """Read command - Read and preview data from various sources. | ||
| This module implements the ``truthound read`` command for loading, | ||
| inspecting, and exporting data from files and database connections. | ||
| """ | ||
| from __future__ import annotations | ||
| from pathlib import Path | ||
| from typing import Annotated, Optional | ||
| import typer | ||
| from truthound.cli_modules.common.datasource import ( | ||
| ConnectionOpt, | ||
| QueryOpt, | ||
| SourceConfigOpt, | ||
| SourceNameOpt, | ||
| TableOpt, | ||
| resolve_datasource, | ||
| ) | ||
| from truthound.cli_modules.common.errors import error_boundary | ||
| from truthound.cli_modules.common.options import parse_list_callback | ||
| @error_boundary | ||
| def read_cmd( | ||
| file: Annotated[ | ||
| Optional[Path], | ||
| typer.Argument( | ||
| help="Path to the data file (CSV, JSON, Parquet, NDJSON)", | ||
| ), | ||
| ] = None, | ||
| # -- DataSource Options -- | ||
| connection: ConnectionOpt = None, | ||
| table: TableOpt = None, | ||
| query: QueryOpt = None, | ||
| source_config: SourceConfigOpt = None, | ||
| source_name: SourceNameOpt = None, | ||
| # -- Row Selection -- | ||
| sample: Annotated[ | ||
| Optional[int], | ||
| typer.Option( | ||
| "--sample", | ||
| "-s", | ||
| help="Return a random sample of N rows", | ||
| min=1, | ||
| ), | ||
| ] = None, | ||
| head: Annotated[ | ||
| Optional[int], | ||
| typer.Option( | ||
| "--head", | ||
| "-n", | ||
| help="Show only the first N rows", | ||
| min=1, | ||
| ), | ||
| ] = None, | ||
| # -- Column Selection -- | ||
| columns: Annotated[ | ||
| Optional[list[str]], | ||
| typer.Option( | ||
| "--columns", | ||
| "-c", | ||
| help="Columns to include (comma-separated)", | ||
| ), | ||
| ] = None, | ||
| # -- Output Options -- | ||
| format: Annotated[ | ||
| str, | ||
| typer.Option( | ||
| "--format", | ||
| "-f", | ||
| help="Output format (table, csv, json, parquet, ndjson)", | ||
| ), | ||
| ] = "table", | ||
| output: Annotated[ | ||
| Optional[Path], | ||
| typer.Option("--output", "-o", help="Output file path"), | ||
| ] = None, | ||
| # -- Inspection Modes -- | ||
| schema_only: Annotated[ | ||
| bool, | ||
| typer.Option( | ||
| "--schema-only", | ||
| help="Show only column names and types (no data loaded)", | ||
| ), | ||
| ] = False, | ||
| count_only: Annotated[ | ||
| bool, | ||
| typer.Option( | ||
| "--count-only", | ||
| help="Show only the row count", | ||
| ), | ||
| ] = False, | ||
| ) -> None: | ||
| """Read and preview data from files or databases. | ||
| Load data from various sources and display a preview, export to | ||
| another format, or inspect the schema. Supports files (CSV, Parquet, | ||
| JSON) and SQL databases via --connection. | ||
| Examples: | ||
| truthound read data.csv | ||
| truthound read data.parquet --head 20 | ||
| truthound read data.csv --format json -o output.json | ||
| truthound read data.csv --columns id,name,age | ||
| truthound read --connection "postgresql://user:pass@host/db" --table users | ||
| truthound read --connection "sqlite:///data.db" --table orders --head 10 | ||
| truthound read --source-config db.yaml --sample 1000 | ||
| truthound read data.csv --schema-only | ||
| truthound read data.csv --count-only | ||
| """ | ||
| import polars as pl | ||
| # Resolve data source | ||
| data_path, source = resolve_datasource( | ||
| file=file, | ||
| connection=connection, | ||
| table=table, | ||
| query=query, | ||
| source_config=source_config, | ||
| source_name=source_name, | ||
| ) | ||
| # Load data as LazyFrame | ||
| if source is not None: | ||
| lf = source.to_polars_lazyframe() | ||
| label = source.name | ||
| else: | ||
| from truthound.adapters import to_lazyframe | ||
| lf = to_lazyframe(data_path) | ||
| label = data_path | ||
| # Schema-only mode: no data collection needed | ||
| if schema_only: | ||
| schema = lf.collect_schema() | ||
| typer.echo(f"Source: {label}") | ||
| typer.echo(f"Columns: {len(schema)}\n") | ||
| typer.echo(f"{'Column':<40} {'Type':<20}") | ||
| typer.echo("-" * 60) | ||
| for col_name, col_type in schema.items(): | ||
| typer.echo(f"{col_name:<40} {str(col_type):<20}") | ||
| return | ||
| # Count-only mode: minimal collection | ||
| if count_only: | ||
| row_count = lf.select(pl.len()).collect().item() | ||
| typer.echo(f"Source: {label}") | ||
| typer.echo(f"Rows: {row_count:,}") | ||
| return | ||
| # Collect data | ||
| df = lf.collect() | ||
| # Column selection | ||
| column_list = parse_list_callback(columns) if columns else None | ||
| if column_list: | ||
| available = set(df.columns) | ||
| missing = [c for c in column_list if c not in available] | ||
| if missing: | ||
| typer.echo( | ||
| f"Warning: columns not found: {', '.join(missing)}", err=True | ||
| ) | ||
| valid_cols = [c for c in column_list if c in available] | ||
| if valid_cols: | ||
| df = df.select(valid_cols) | ||
| # Row selection | ||
| if sample is not None and len(df) > sample: | ||
| df = df.sample(n=sample, seed=42) | ||
| if head is not None: | ||
| df = df.head(head) | ||
| # Output | ||
| if format == "parquet" and output is None: | ||
| typer.echo( | ||
| "Error: --output is required for parquet format", err=True | ||
| ) | ||
| raise typer.Exit(1) | ||
| if output: | ||
| _write_output(df, output, format) | ||
| typer.echo(f"Data written to {output} ({len(df):,} rows)") | ||
| else: | ||
| _print_output(df, format, label) | ||
| def _write_output(df: "pl.DataFrame", output: Path, fmt: str) -> None: | ||
| """Write DataFrame to a file in the specified format.""" | ||
| suffix = output.suffix.lower() | ||
| fmt_lower = fmt.lower() | ||
| if fmt_lower == "parquet" or suffix == ".parquet": | ||
| df.write_parquet(output) | ||
| elif fmt_lower == "csv" or suffix == ".csv": | ||
| df.write_csv(output) | ||
| elif fmt_lower == "json" or suffix == ".json": | ||
| df.write_json(output) | ||
| elif fmt_lower == "ndjson" or suffix == ".ndjson": | ||
| df.write_ndjson(output) | ||
| else: | ||
| # Default: CSV | ||
| df.write_csv(output) | ||
| def _print_output(df: "pl.DataFrame", fmt: str, label: str | None) -> None: | ||
| """Print DataFrame to stdout.""" | ||
| import polars as pl | ||
| fmt_lower = fmt.lower() | ||
| if fmt_lower == "json": | ||
| typer.echo(df.write_json()) | ||
| elif fmt_lower == "csv": | ||
| typer.echo(df.write_csv()) | ||
| elif fmt_lower == "ndjson": | ||
| typer.echo(df.write_ndjson()) | ||
| else: | ||
| # Table format: use Polars' built-in display | ||
| if label: | ||
| typer.echo(f"Source: {label}") | ||
| typer.echo(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns\n") | ||
| with pl.Config(tbl_rows=50, tbl_cols=20, fmt_str_lengths=80): | ||
| typer.echo(str(df)) |
| """Tests for DataSource support across CLI commands. | ||
| Verifies that scan, mask, profile, learn, and compare commands | ||
| correctly accept and pass through database connection options. | ||
| """ | ||
| from __future__ import annotations | ||
| import json | ||
| import pytest | ||
| from pathlib import Path | ||
| from unittest.mock import MagicMock, patch | ||
| import polars as pl | ||
| import typer | ||
| from typer.testing import CliRunner | ||
| from truthound.cli_modules.core.check import check_cmd | ||
| from truthound.cli_modules.core.scan import scan_cmd | ||
| from truthound.cli_modules.core.mask import mask_cmd | ||
| from truthound.cli_modules.core.profile import profile_cmd | ||
| from truthound.cli_modules.core.learn import learn_cmd | ||
| from truthound.cli_modules.core.compare import compare_cmd | ||
| @pytest.fixture | ||
| def runner(): | ||
| return CliRunner() | ||
| def _make_app(cmd, name): | ||
| app = typer.Typer() | ||
| app.command(name=name)(cmd) | ||
| return app | ||
| @pytest.fixture | ||
| def sample_csv(tmp_path): | ||
| csv = tmp_path / "data.csv" | ||
| csv.write_text("id,name,age\n1,Alice,25\n2,Bob,30\n") | ||
| return csv | ||
| def _mock_sql_source(table_name="users"): | ||
| """Create a mock SQL data source returning a small DataFrame.""" | ||
| source = MagicMock() | ||
| source.name = table_name | ||
| lf = pl.LazyFrame({"id": [1, 2], "name": ["Alice", "Bob"], "age": [25, 30]}) | ||
| source.to_polars_lazyframe.return_value = lf | ||
| return source | ||
| # ============================================================================= | ||
| # Check with DataSource | ||
| # ============================================================================= | ||
| class TestCheckWithDatasource: | ||
| """Test check command accepts datasource options.""" | ||
| def test_check_with_connection(self, runner, sample_csv): | ||
| """--connection passes source= to check API.""" | ||
| app = _make_app(check_cmd, "check") | ||
| with ( | ||
| patch("truthound.datasources.factory.get_sql_datasource") as mock_sql, | ||
| patch("truthound.api.check") as mock_check, | ||
| ): | ||
| mock_sql.return_value = _mock_sql_source() | ||
| mock_report = MagicMock() | ||
| mock_report.has_issues = False | ||
| mock_report.exception_summary = None | ||
| mock_check.return_value = mock_report | ||
| result = runner.invoke(app, [ | ||
| "--connection", "postgresql://user:pass@host/db", | ||
| "--table", "users", | ||
| ]) | ||
| assert result.exit_code == 0 | ||
| # Verify source= was passed (not data path) | ||
| assert mock_check.call_args[1].get("source") is not None or \ | ||
| mock_check.call_args.kwargs.get("source") is not None | ||
| def test_check_file_and_connection_mutually_exclusive(self, runner, sample_csv): | ||
| """file + --connection raises error.""" | ||
| app = _make_app(check_cmd, "check") | ||
| result = runner.invoke(app, [ | ||
| str(sample_csv), | ||
| "--connection", "postgresql://host/db", | ||
| "--table", "t", | ||
| ]) | ||
| assert result.exit_code != 0 | ||
| # ============================================================================= | ||
| # Scan with DataSource | ||
| # ============================================================================= | ||
| class TestScanWithDatasource: | ||
| """Test scan command accepts datasource options.""" | ||
| def test_scan_with_connection(self, runner): | ||
| """--connection passes source= to scan API.""" | ||
| app = _make_app(scan_cmd, "scan") | ||
| with ( | ||
| patch("truthound.datasources.factory.get_sql_datasource") as mock_sql, | ||
| patch("truthound.api.scan") as mock_scan, | ||
| ): | ||
| mock_sql.return_value = _mock_sql_source() | ||
| mock_report = MagicMock() | ||
| mock_scan.return_value = mock_report | ||
| mock_report.print = MagicMock() | ||
| result = runner.invoke(app, [ | ||
| "--connection", "postgresql://user:pass@host/db", | ||
| "--table", "users", | ||
| ]) | ||
| assert result.exit_code == 0 | ||
| mock_scan.assert_called_once() | ||
| call_kwargs = mock_scan.call_args | ||
| # scan(source=source) | ||
| assert call_kwargs.kwargs.get("source") is not None | ||
| def test_scan_no_input_error(self, runner): | ||
| """No input produces an error.""" | ||
| app = _make_app(scan_cmd, "scan") | ||
| result = runner.invoke(app, []) | ||
| assert result.exit_code != 0 | ||
| # ============================================================================= | ||
| # Mask with DataSource | ||
| # ============================================================================= | ||
| class TestMaskWithDatasource: | ||
| """Test mask command accepts datasource options.""" | ||
| def test_mask_with_connection(self, runner, tmp_path): | ||
| """--connection passes source= to mask API.""" | ||
| app = _make_app(mask_cmd, "mask") | ||
| out = tmp_path / "masked.csv" | ||
| with ( | ||
| patch("truthound.datasources.factory.get_sql_datasource") as mock_sql, | ||
| patch("truthound.api.mask") as mock_mask, | ||
| ): | ||
| mock_sql.return_value = _mock_sql_source() | ||
| mock_df = pl.DataFrame({"id": [1, 2], "name": ["***", "***"]}) | ||
| mock_mask.return_value = mock_df | ||
| result = runner.invoke(app, [ | ||
| "--connection", "postgresql://user:pass@host/db", | ||
| "--table", "users", | ||
| "--output", str(out), | ||
| ]) | ||
| assert result.exit_code == 0 | ||
| assert out.exists() | ||
| mock_mask.assert_called_once() | ||
| # ============================================================================= | ||
| # Profile with DataSource | ||
| # ============================================================================= | ||
| class TestProfileWithDatasource: | ||
| """Test profile command accepts datasource options.""" | ||
| def test_profile_with_connection(self, runner): | ||
| """--connection passes source= to profile API.""" | ||
| app = _make_app(profile_cmd, "profile") | ||
| with ( | ||
| patch("truthound.datasources.factory.get_sql_datasource") as mock_sql, | ||
| patch("truthound.api.profile") as mock_profile, | ||
| ): | ||
| mock_sql.return_value = _mock_sql_source() | ||
| mock_report = MagicMock() | ||
| mock_profile.return_value = mock_report | ||
| mock_report.print = MagicMock() | ||
| result = runner.invoke(app, [ | ||
| "--connection", "postgresql://user:pass@host/db", | ||
| "--table", "users", | ||
| ]) | ||
| assert result.exit_code == 0 | ||
| mock_profile.assert_called_once() | ||
| assert mock_profile.call_args.kwargs.get("source") is not None | ||
| # ============================================================================= | ||
| # Learn with DataSource | ||
| # ============================================================================= | ||
| class TestLearnWithDatasource: | ||
| """Test learn command accepts datasource options.""" | ||
| def test_learn_with_connection(self, runner, tmp_path): | ||
| """--connection passes source= to learn API.""" | ||
| app = _make_app(learn_cmd, "learn") | ||
| out = tmp_path / "schema.yaml" | ||
| with ( | ||
| patch("truthound.datasources.factory.get_sql_datasource") as mock_sql, | ||
| patch("truthound.schema.learn") as mock_learn, | ||
| ): | ||
| mock_sql.return_value = _mock_sql_source() | ||
| mock_schema = MagicMock() | ||
| mock_schema.columns = ["id", "name", "age"] | ||
| mock_schema.row_count = 2 | ||
| mock_learn.return_value = mock_schema | ||
| result = runner.invoke(app, [ | ||
| "--connection", "postgresql://user:pass@host/db", | ||
| "--table", "users", | ||
| "--output", str(out), | ||
| ]) | ||
| assert result.exit_code == 0 | ||
| mock_learn.assert_called_once() | ||
| assert mock_learn.call_args.kwargs.get("source") is not None | ||
| # ============================================================================= | ||
| # Compare with DataSource Config | ||
| # ============================================================================= | ||
| class TestCompareWithDatasource: | ||
| """Test compare command accepts --source-config for dual sources.""" | ||
| def test_compare_with_source_config(self, runner, tmp_path): | ||
| """--source-config with baseline/current sections works.""" | ||
| app = _make_app(compare_cmd, "compare") | ||
| cfg = tmp_path / "drift.yaml" | ||
| cfg.write_text( | ||
| "baseline:\n" | ||
| " connection: 'postgresql://host/db'\n" | ||
| " table: train\n" | ||
| "current:\n" | ||
| " connection: 'postgresql://host/db'\n" | ||
| " table: prod\n" | ||
| ) | ||
| with ( | ||
| patch("truthound.datasources.factory.get_sql_datasource") as mock_sql, | ||
| patch("truthound.drift.compare") as mock_compare, | ||
| ): | ||
| source_b = MagicMock() | ||
| source_c = MagicMock() | ||
| lf_b = pl.LazyFrame({"x": [1, 2, 3]}) | ||
| lf_c = pl.LazyFrame({"x": [4, 5, 6]}) | ||
| source_b.to_polars_lazyframe.return_value = lf_b | ||
| source_c.to_polars_lazyframe.return_value = lf_c | ||
| mock_sql.side_effect = [source_b, source_c] | ||
| mock_report = MagicMock() | ||
| mock_report.has_drift = False | ||
| mock_compare.return_value = mock_report | ||
| mock_report.print = MagicMock() | ||
| result = runner.invoke(app, [ | ||
| "--source-config", str(cfg), | ||
| ]) | ||
| assert result.exit_code == 0 | ||
| mock_compare.assert_called_once() | ||
| def test_compare_files_still_works(self, runner, tmp_path): | ||
| """Positional file arguments still work.""" | ||
| app = _make_app(compare_cmd, "compare") | ||
| f1 = tmp_path / "base.csv" | ||
| f2 = tmp_path / "curr.csv" | ||
| f1.write_text("x\n1\n2\n3\n") | ||
| f2.write_text("x\n4\n5\n6\n") | ||
| with patch("truthound.drift.compare") as mock_compare: | ||
| mock_report = MagicMock() | ||
| mock_report.has_drift = False | ||
| mock_compare.return_value = mock_report | ||
| mock_report.print = MagicMock() | ||
| result = runner.invoke(app, [str(f1), str(f2)]) | ||
| assert result.exit_code == 0 | ||
| mock_compare.assert_called_once() |
| """Tests for the shared DataSource resolution layer. | ||
| Tests cover: resolve_datasource(), resolve_compare_sources(), | ||
| parse_source_config(), create_datasource_from_config(), and | ||
| input validation logic. | ||
| """ | ||
| from __future__ import annotations | ||
| import json | ||
| import pytest | ||
| from pathlib import Path | ||
| from unittest.mock import MagicMock, patch | ||
| from truthound.cli_modules.common.datasource import ( | ||
| create_datasource_from_config, | ||
| parse_source_config, | ||
| resolve_compare_sources, | ||
| resolve_datasource, | ||
| ) | ||
| from truthound.cli_modules.common.errors import DataSourceError | ||
| # ============================================================================= | ||
| # Fixtures | ||
| # ============================================================================= | ||
| @pytest.fixture | ||
| def sample_csv(tmp_path): | ||
| """Create a sample CSV file.""" | ||
| csv = tmp_path / "data.csv" | ||
| csv.write_text("id,name\n1,Alice\n2,Bob\n") | ||
| return csv | ||
| @pytest.fixture | ||
| def source_config_json(tmp_path): | ||
| """Create a JSON source config file.""" | ||
| cfg = tmp_path / "source.json" | ||
| cfg.write_text(json.dumps({ | ||
| "connection": "postgresql://user:pass@host:5432/db", | ||
| "table": "users", | ||
| })) | ||
| return cfg | ||
| @pytest.fixture | ||
| def source_config_yaml(tmp_path): | ||
| """Create a YAML source config file.""" | ||
| cfg = tmp_path / "source.yaml" | ||
| cfg.write_text( | ||
| "connection: 'postgresql://user:pass@host:5432/db'\n" | ||
| "table: users\n" | ||
| ) | ||
| return cfg | ||
| @pytest.fixture | ||
| def compare_config_yaml(tmp_path): | ||
| """Create a YAML compare config file with baseline/current sections.""" | ||
| cfg = tmp_path / "compare.yaml" | ||
| cfg.write_text( | ||
| "baseline:\n" | ||
| " connection: 'postgresql://user:pass@host/db'\n" | ||
| " table: train_data\n" | ||
| "current:\n" | ||
| " connection: 'postgresql://user:pass@host/db'\n" | ||
| " table: prod_data\n" | ||
| ) | ||
| return cfg | ||
| # ============================================================================= | ||
| # resolve_datasource | ||
| # ============================================================================= | ||
| class TestResolveDatasource: | ||
| """Tests for resolve_datasource().""" | ||
| def test_file_only_returns_path(self, sample_csv): | ||
| """File-only input returns (str_path, None).""" | ||
| data_path, source = resolve_datasource(file=sample_csv) | ||
| assert data_path == str(sample_csv) | ||
| assert source is None | ||
| def test_no_input_raises_error(self): | ||
| """No input raises DataSourceError.""" | ||
| with pytest.raises(DataSourceError, match="No data input specified"): | ||
| resolve_datasource() | ||
| def test_file_not_found_raises_error(self, tmp_path): | ||
| """Non-existent file raises error.""" | ||
| fake = tmp_path / "nonexistent.csv" | ||
| with pytest.raises(Exception): | ||
| resolve_datasource(file=fake) | ||
| def test_file_and_connection_mutually_exclusive(self, sample_csv): | ||
| """Providing both file and connection raises error.""" | ||
| with pytest.raises(DataSourceError, match="Conflicting"): | ||
| resolve_datasource(file=sample_csv, connection="postgresql://host/db") | ||
| def test_file_and_source_config_mutually_exclusive(self, sample_csv, source_config_json): | ||
| """Providing both file and source_config raises error.""" | ||
| with pytest.raises(DataSourceError, match="Conflicting"): | ||
| resolve_datasource(file=sample_csv, source_config=source_config_json) | ||
| def test_connection_and_source_config_mutually_exclusive(self, source_config_json): | ||
| """Providing both connection and source_config raises error.""" | ||
| with pytest.raises(DataSourceError, match="Conflicting"): | ||
| resolve_datasource( | ||
| connection="postgresql://host/db", | ||
| source_config=source_config_json, | ||
| ) | ||
| def test_connection_without_table_raises_error(self): | ||
| """Connection without table or query raises error.""" | ||
| with pytest.raises(DataSourceError, match="--table or --query"): | ||
| resolve_datasource(connection="postgresql://user:pass@host/db") | ||
| @patch("truthound.datasources.factory.get_sql_datasource") | ||
| def test_connection_with_table_returns_source(self, mock_get_sql): | ||
| """Connection + table returns (None, source).""" | ||
| mock_source = MagicMock() | ||
| mock_get_sql.return_value = mock_source | ||
| data_path, source = resolve_datasource( | ||
| connection="postgresql://user:pass@host/db", | ||
| table="users", | ||
| ) | ||
| assert data_path is None | ||
| assert source is mock_source | ||
| mock_get_sql.assert_called_once_with( | ||
| "postgresql://user:pass@host/db", table="users", query=None | ||
| ) | ||
| @patch("truthound.datasources.factory.get_sql_datasource") | ||
| def test_connection_with_query_returns_source(self, mock_get_sql): | ||
| """Connection + query returns (None, source).""" | ||
| mock_source = MagicMock() | ||
| mock_get_sql.return_value = mock_source | ||
| data_path, source = resolve_datasource( | ||
| connection="postgresql://user:pass@host/db", | ||
| query="SELECT * FROM orders WHERE date > '2024-01-01'", | ||
| ) | ||
| assert data_path is None | ||
| assert source is mock_source | ||
| @patch("truthound.cli_modules.common.datasource.create_datasource_from_config") | ||
| @patch("truthound.cli_modules.common.datasource.parse_source_config") | ||
| def test_source_config_returns_source(self, mock_parse, mock_create, source_config_json): | ||
| """Source config file returns (None, source).""" | ||
| mock_config = {"connection": "postgresql://...", "table": "users"} | ||
| mock_parse.return_value = mock_config | ||
| mock_source = MagicMock() | ||
| mock_create.return_value = mock_source | ||
| data_path, source = resolve_datasource(source_config=source_config_json) | ||
| assert data_path is None | ||
| assert source is mock_source | ||
| mock_parse.assert_called_once_with(source_config_json) | ||
| mock_create.assert_called_once_with(mock_config) | ||
| @patch("truthound.datasources.factory.get_sql_datasource") | ||
| def test_source_name_applied(self, mock_get_sql): | ||
| """--source-name is applied to the data source.""" | ||
| mock_source = MagicMock() | ||
| mock_source.config = MagicMock() | ||
| mock_get_sql.return_value = mock_source | ||
| resolve_datasource( | ||
| connection="postgresql://host/db", | ||
| table="users", | ||
| source_name="my-label", | ||
| ) | ||
| # source_name should have been set | ||
| assert mock_source.config.name == "my-label" | ||
| # ============================================================================= | ||
| # resolve_compare_sources | ||
| # ============================================================================= | ||
| class TestResolveCompareSources: | ||
| """Tests for resolve_compare_sources().""" | ||
| def test_two_files_returns_paths(self, tmp_path): | ||
| """Two file paths return ((path1, None), (path2, None)).""" | ||
| f1 = tmp_path / "base.csv" | ||
| f2 = tmp_path / "curr.csv" | ||
| f1.write_text("a\n1\n") | ||
| f2.write_text("a\n2\n") | ||
| (bp, bs), (cp, cs) = resolve_compare_sources(baseline=f1, current=f2) | ||
| assert bp == str(f1) and bs is None | ||
| assert cp == str(f2) and cs is None | ||
| def test_missing_one_file_raises_error(self, tmp_path): | ||
| """Only one file provided raises error.""" | ||
| f1 = tmp_path / "base.csv" | ||
| f1.write_text("a\n1\n") | ||
| with pytest.raises(DataSourceError, match="Both baseline and current"): | ||
| resolve_compare_sources(baseline=f1) | ||
| def test_no_files_no_config_raises_error(self): | ||
| """No arguments raises error.""" | ||
| with pytest.raises(DataSourceError, match="Both baseline and current"): | ||
| resolve_compare_sources() | ||
| def test_files_and_config_raises_error(self, tmp_path, compare_config_yaml): | ||
| """Files + config raises error.""" | ||
| f1 = tmp_path / "base.csv" | ||
| f1.write_text("a\n1\n") | ||
| with pytest.raises(DataSourceError, match="Cannot specify both"): | ||
| resolve_compare_sources(baseline=f1, source_config=compare_config_yaml) | ||
| @patch("truthound.cli_modules.common.datasource.create_datasource_from_config") | ||
| @patch("truthound.cli_modules.common.datasource.parse_source_config") | ||
| def test_config_returns_dual_sources(self, mock_parse, mock_create, compare_config_yaml): | ||
| """Config file with baseline/current returns two sources.""" | ||
| mock_parse.return_value = { | ||
| "baseline": {"connection": "pg://...", "table": "train"}, | ||
| "current": {"connection": "pg://...", "table": "prod"}, | ||
| } | ||
| mock_source_b = MagicMock() | ||
| mock_source_c = MagicMock() | ||
| mock_create.side_effect = [mock_source_b, mock_source_c] | ||
| (bp, bs), (cp, cs) = resolve_compare_sources(source_config=compare_config_yaml) | ||
| assert bp is None and bs is mock_source_b | ||
| assert cp is None and cs is mock_source_c | ||
| @patch("truthound.cli_modules.common.datasource.parse_source_config") | ||
| def test_config_missing_baseline_raises_error(self, mock_parse, compare_config_yaml): | ||
| """Config missing baseline section raises error.""" | ||
| mock_parse.return_value = {"current": {"connection": "pg://...", "table": "t"}} | ||
| with pytest.raises(DataSourceError, match="baseline.*current"): | ||
| resolve_compare_sources(source_config=compare_config_yaml) | ||
| # ============================================================================= | ||
| # parse_source_config | ||
| # ============================================================================= | ||
| class TestParseSourceConfig: | ||
| """Tests for parse_source_config().""" | ||
| def test_parse_json(self, tmp_path): | ||
| """JSON config file is parsed correctly.""" | ||
| cfg = tmp_path / "cfg.json" | ||
| cfg.write_text(json.dumps({"connection": "pg://host/db", "table": "t"})) | ||
| result = parse_source_config(cfg) | ||
| assert result["connection"] == "pg://host/db" | ||
| assert result["table"] == "t" | ||
| def test_parse_yaml(self, tmp_path): | ||
| """YAML config file is parsed correctly.""" | ||
| cfg = tmp_path / "cfg.yaml" | ||
| cfg.write_text("connection: 'pg://host/db'\ntable: t\n") | ||
| result = parse_source_config(cfg) | ||
| assert result["connection"] == "pg://host/db" | ||
| assert result["table"] == "t" | ||
| def test_parse_yml(self, tmp_path): | ||
| """YML extension also works.""" | ||
| cfg = tmp_path / "cfg.yml" | ||
| cfg.write_text("connection: 'pg://host/db'\ntable: t\n") | ||
| result = parse_source_config(cfg) | ||
| assert result["table"] == "t" | ||
| def test_invalid_json_raises_error(self, tmp_path): | ||
| """Malformed JSON raises DataSourceError.""" | ||
| cfg = tmp_path / "bad.json" | ||
| cfg.write_text("{invalid json}") | ||
| with pytest.raises(DataSourceError, match="Invalid JSON"): | ||
| parse_source_config(cfg) | ||
| def test_non_dict_raises_error(self, tmp_path): | ||
| """Non-dict JSON content raises DataSourceError.""" | ||
| cfg = tmp_path / "arr.json" | ||
| cfg.write_text('["a", "b"]') | ||
| with pytest.raises(DataSourceError, match="must be a JSON/YAML object"): | ||
| parse_source_config(cfg) | ||
| def test_unsupported_extension_raises_error(self, tmp_path): | ||
| """Unsupported file extension raises DataSourceError.""" | ||
| cfg = tmp_path / "cfg.toml" | ||
| cfg.write_text("[table]\nname = 'x'") | ||
| with pytest.raises(DataSourceError, match="Unsupported config file format"): | ||
| parse_source_config(cfg) | ||
| # ============================================================================= | ||
| # create_datasource_from_config | ||
| # ============================================================================= | ||
| class TestCreateDatasourceFromConfig: | ||
| """Tests for create_datasource_from_config().""" | ||
| @patch("truthound.datasources.factory.get_sql_datasource") | ||
| def test_connection_string_style(self, mock_get_sql): | ||
| """Config with 'connection' delegates to get_sql_datasource.""" | ||
| mock_source = MagicMock() | ||
| mock_get_sql.return_value = mock_source | ||
| result = create_datasource_from_config({ | ||
| "connection": "postgresql://host/db", | ||
| "table": "users", | ||
| }) | ||
| assert result is mock_source | ||
| def test_connection_without_table_raises_error(self): | ||
| """Config with 'connection' but no 'table' raises error.""" | ||
| with pytest.raises(DataSourceError, match="requires 'table' or 'query'"): | ||
| create_datasource_from_config({"connection": "postgresql://host/db"}) | ||
| def test_no_connection_no_type_raises_error(self): | ||
| """Config without 'connection' or 'type' raises error.""" | ||
| with pytest.raises(DataSourceError, match="must have either"): | ||
| create_datasource_from_config({"table": "users"}) | ||
| @patch("truthound.datasources.sql.get_available_sources") | ||
| def test_type_not_available_raises_error(self, mock_available): | ||
| """Unavailable type raises DataSourceError with available list.""" | ||
| mock_available.return_value = {"postgresql": MagicMock(), "mysql": MagicMock()} | ||
| with pytest.raises(DataSourceError, match="not available"): | ||
| create_datasource_from_config({ | ||
| "type": "oracle", | ||
| "table": "users", | ||
| "host": "localhost", | ||
| }) | ||
| @patch("truthound.datasources.sql.get_available_sources") | ||
| def test_type_style_creates_source(self, mock_available): | ||
| """Config with 'type' constructs from source class.""" | ||
| mock_cls = MagicMock() | ||
| mock_source = MagicMock() | ||
| mock_cls.return_value = mock_source | ||
| mock_available.return_value = {"postgresql": mock_cls} | ||
| result = create_datasource_from_config({ | ||
| "type": "postgresql", | ||
| "table": "users", | ||
| "host": "localhost", | ||
| "database": "mydb", | ||
| }) | ||
| assert result is mock_source | ||
| mock_cls.assert_called_once_with( | ||
| table="users", host="localhost", database="mydb" | ||
| ) |
| """Tests for the ``truthound read`` CLI command.""" | ||
| from __future__ import annotations | ||
| import json | ||
| import pytest | ||
| from pathlib import Path | ||
| from unittest.mock import patch, MagicMock | ||
| import typer | ||
| from typer.testing import CliRunner | ||
| from truthound.cli_modules.core.read import read_cmd | ||
| @pytest.fixture | ||
| def runner(): | ||
| return CliRunner() | ||
| @pytest.fixture | ||
| def app(): | ||
| _app = typer.Typer() | ||
| _app.command(name="read")(read_cmd) | ||
| return _app | ||
| @pytest.fixture | ||
| def sample_csv(tmp_path): | ||
| csv = tmp_path / "data.csv" | ||
| csv.write_text( | ||
| "id,name,age,city\n" | ||
| "1,Alice,25,NYC\n" | ||
| "2,Bob,30,LA\n" | ||
| "3,Charlie,35,Chicago\n" | ||
| "4,Diana,40,Boston\n" | ||
| "5,Eve,28,Seattle\n" | ||
| ) | ||
| return csv | ||
| @pytest.fixture | ||
| def sample_json(tmp_path): | ||
| jf = tmp_path / "data.json" | ||
| data = [ | ||
| {"id": 1, "name": "Alice"}, | ||
| {"id": 2, "name": "Bob"}, | ||
| ] | ||
| jf.write_text(json.dumps(data)) | ||
| return jf | ||
| # ============================================================================= | ||
| # Basic Read | ||
| # ============================================================================= | ||
| class TestReadBasic: | ||
| """Basic file reading tests.""" | ||
| def test_read_csv(self, runner, app, sample_csv): | ||
| """Read CSV file outputs data.""" | ||
| result = runner.invoke(app, [str(sample_csv)]) | ||
| assert result.exit_code == 0 | ||
| assert "5 rows" in result.output or "Shape" in result.output | ||
| def test_read_no_input_error(self, runner, app): | ||
| """No input produces an error.""" | ||
| result = runner.invoke(app, []) | ||
| assert result.exit_code != 0 | ||
| def test_read_nonexistent_file_error(self, runner, app, tmp_path): | ||
| """Non-existent file produces an error.""" | ||
| fake = tmp_path / "missing.csv" | ||
| result = runner.invoke(app, [str(fake)]) | ||
| assert result.exit_code != 0 | ||
| # ============================================================================= | ||
| # Row/Column Selection | ||
| # ============================================================================= | ||
| class TestReadSelection: | ||
| """Row and column selection tests.""" | ||
| def test_head(self, runner, app, sample_csv): | ||
| """--head limits rows.""" | ||
| result = runner.invoke(app, [str(sample_csv), "--head", "2"]) | ||
| assert result.exit_code == 0 | ||
| assert "2 rows" in result.output or "Shape: 2" in result.output | ||
| def test_columns(self, runner, app, sample_csv): | ||
| """--columns selects specific columns.""" | ||
| result = runner.invoke(app, [str(sample_csv), "--columns", "id,name"]) | ||
| assert result.exit_code == 0 | ||
| assert "2 columns" in result.output or "x 2" in result.output | ||
| def test_columns_missing_warns(self, runner, app, sample_csv): | ||
| """Missing columns produce a warning.""" | ||
| result = runner.invoke(app, [str(sample_csv), "--columns", "id,nonexistent"]) | ||
| assert result.exit_code == 0 | ||
| assert "not found" in result.output | ||
| def test_head_and_columns(self, runner, app, sample_csv): | ||
| """--head and --columns together.""" | ||
| result = runner.invoke(app, [str(sample_csv), "--head", "3", "--columns", "name,age"]) | ||
| assert result.exit_code == 0 | ||
| def test_sample(self, runner, app, sample_csv): | ||
| """--sample returns subset.""" | ||
| result = runner.invoke(app, [str(sample_csv), "--sample", "2"]) | ||
| assert result.exit_code == 0 | ||
| assert "2 rows" in result.output or "Shape: 2" in result.output | ||
| # ============================================================================= | ||
| # Inspection Modes | ||
| # ============================================================================= | ||
| class TestReadInspection: | ||
| """Schema-only and count-only mode tests.""" | ||
| def test_schema_only(self, runner, app, sample_csv): | ||
| """--schema-only shows column names and types.""" | ||
| result = runner.invoke(app, [str(sample_csv), "--schema-only"]) | ||
| assert result.exit_code == 0 | ||
| assert "Column" in result.output | ||
| assert "Type" in result.output | ||
| assert "id" in result.output | ||
| assert "name" in result.output | ||
| def test_count_only(self, runner, app, sample_csv): | ||
| """--count-only shows just the row count.""" | ||
| result = runner.invoke(app, [str(sample_csv), "--count-only"]) | ||
| assert result.exit_code == 0 | ||
| assert "Rows:" in result.output | ||
| assert "5" in result.output | ||
| # ============================================================================= | ||
| # Output Formats | ||
| # ============================================================================= | ||
| class TestReadFormats: | ||
| """Output format tests.""" | ||
| def test_format_csv(self, runner, app, sample_csv): | ||
| """--format csv outputs CSV text.""" | ||
| result = runner.invoke(app, [str(sample_csv), "--format", "csv", "--head", "2"]) | ||
| assert result.exit_code == 0 | ||
| assert "id,name,age,city" in result.output | ||
| def test_format_json(self, runner, app, sample_csv): | ||
| """--format json outputs valid JSON.""" | ||
| result = runner.invoke(app, [str(sample_csv), "--format", "json", "--head", "2"]) | ||
| assert result.exit_code == 0 | ||
| data = json.loads(result.output) | ||
| # Polars write_json output is valid JSON (format may vary by version) | ||
| assert isinstance(data, (dict, list)) | ||
| def test_format_ndjson(self, runner, app, sample_csv): | ||
| """--format ndjson outputs newline-delimited JSON.""" | ||
| result = runner.invoke(app, [str(sample_csv), "--format", "ndjson", "--head", "2"]) | ||
| assert result.exit_code == 0 | ||
| lines = [l for l in result.output.strip().split("\n") if l.strip()] | ||
| assert len(lines) == 2 | ||
| def test_parquet_requires_output(self, runner, app, sample_csv): | ||
| """--format parquet without --output is an error.""" | ||
| result = runner.invoke(app, [str(sample_csv), "--format", "parquet"]) | ||
| assert result.exit_code == 1 | ||
| assert "required" in result.output.lower() | ||
| # ============================================================================= | ||
| # Output File | ||
| # ============================================================================= | ||
| class TestReadOutput: | ||
| """Output file tests.""" | ||
| def test_output_csv(self, runner, app, sample_csv, tmp_path): | ||
| """--output writes CSV file.""" | ||
| out = tmp_path / "out.csv" | ||
| result = runner.invoke(app, [str(sample_csv), "--output", str(out), "--head", "3"]) | ||
| assert result.exit_code == 0 | ||
| assert out.exists() | ||
| assert "written to" in result.output | ||
| content = out.read_text() | ||
| assert "id" in content | ||
| def test_output_json(self, runner, app, sample_csv, tmp_path): | ||
| """--output with json format writes JSON file.""" | ||
| out = tmp_path / "out.json" | ||
| result = runner.invoke(app, [ | ||
| str(sample_csv), "--output", str(out), "--format", "json", "--head", "2", | ||
| ]) | ||
| assert result.exit_code == 0 | ||
| assert out.exists() | ||
| def test_output_parquet(self, runner, app, sample_csv, tmp_path): | ||
| """--output with parquet format writes Parquet file.""" | ||
| out = tmp_path / "out.parquet" | ||
| result = runner.invoke(app, [ | ||
| str(sample_csv), "--output", str(out), "--format", "parquet", | ||
| ]) | ||
| assert result.exit_code == 0 | ||
| assert out.exists() | ||
| assert out.stat().st_size > 0 | ||
| # ============================================================================= | ||
| # DataSource Integration (mocked) | ||
| # ============================================================================= | ||
| class TestReadWithConnection: | ||
| """Test read command with mocked database connection.""" | ||
| @patch("truthound.datasources.factory.get_sql_datasource") | ||
| def test_read_with_connection(self, mock_get_sql, runner, app): | ||
| """--connection + --table uses DataSource.""" | ||
| import polars as pl | ||
| mock_source = MagicMock() | ||
| mock_source.name = "test_table" | ||
| mock_lf = pl.LazyFrame({"id": [1, 2], "name": ["a", "b"]}) | ||
| mock_source.to_polars_lazyframe.return_value = mock_lf | ||
| mock_get_sql.return_value = mock_source | ||
| result = runner.invoke(app, [ | ||
| "--connection", "postgresql://user:pass@host/db", | ||
| "--table", "users", | ||
| ]) | ||
| assert result.exit_code == 0 | ||
| assert "2 rows" in result.output or "Shape" in result.output | ||
| @patch("truthound.datasources.factory.get_sql_datasource") | ||
| def test_read_schema_only_with_connection(self, mock_get_sql, runner, app): | ||
| """--schema-only works with database source.""" | ||
| import polars as pl | ||
| mock_source = MagicMock() | ||
| mock_source.name = "test_table" | ||
| mock_lf = pl.LazyFrame({"id": [1], "name": ["a"]}) | ||
| mock_source.to_polars_lazyframe.return_value = mock_lf | ||
| mock_get_sql.return_value = mock_source | ||
| result = runner.invoke(app, [ | ||
| "--connection", "postgresql://user:pass@host/db", | ||
| "--table", "users", | ||
| "--schema-only", | ||
| ]) | ||
| assert result.exit_code == 0 | ||
| assert "id" in result.output | ||
| assert "name" in result.output |
@@ -8,3 +8,3 @@ # truthound check | ||
| ```bash | ||
| truthound check <file> [OPTIONS] | ||
| truthound check [FILE] [OPTIONS] | ||
| ``` | ||
@@ -16,4 +16,14 @@ | ||
| |----------|----------|-------------| | ||
| | `file` | Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) | | ||
| | `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) | | ||
| ## Data Source Options | ||
| | Option | Short | Default | Description | | ||
| |--------|-------|---------|-------------| | ||
| | `--connection` | `--conn` | None | Database connection string | | ||
| | `--table` | | None | Database table name | | ||
| | `--query` | | None | SQL query (alternative to `--table`) | | ||
| | `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) | | ||
| | `--source-name` | | None | Custom label for the data source | | ||
| ## Options | ||
@@ -42,2 +52,5 @@ | ||
| | `--max-unexpected-rows` | | `1000` | Maximum number of unexpected rows to include | | ||
| | `--partial-unexpected-count` | | `20` | Maximum number of unexpected values in partial list (BASIC+) | | ||
| | `--include-unexpected-index` | | `false` | Include row index for each unexpected value in results | | ||
| | `--return-debug-query` | | `false` | Include Polars debug query expression in results (COMPLETE level) | | ||
@@ -52,2 +65,11 @@ ### Exception Handling Options (VE-5) | ||
| ### Execution Options | ||
| | Option | Short | Default | Description | | ||
| |--------|-------|---------|-------------| | ||
| | `--parallel` / `--no-parallel` | | `false` | Enable DAG-based parallel execution with dependency-aware scheduling | | ||
| | `--max-workers` | | Auto | Maximum worker threads (only with `--parallel`). Defaults to `min(32, cpu_count + 4)` | | ||
| | `--pushdown` / `--no-pushdown` | | Auto | Enable query pushdown for SQL data sources. Auto-detects by default | | ||
| | `--use-engine` / `--no-use-engine` | | `false` | Use execution engine for validation (experimental) | | ||
| ## Description | ||
@@ -149,2 +171,73 @@ | ||
| ### Parallel Execution | ||
| Enable DAG-based parallel execution for large validator sets: | ||
| ```bash | ||
| # Enable parallel execution with automatic worker count | ||
| truthound check data.csv --parallel | ||
| # Control the number of worker threads | ||
| truthound check data.csv --parallel --max-workers 8 | ||
| # Combine with other options | ||
| truthound check data.csv --parallel --max-workers 4 --rf summary --strict | ||
| ``` | ||
| Validators are organized into dependency levels (Schema → Completeness → Uniqueness → Distribution → Referential) and executed concurrently within each level. | ||
| ### Advanced Result Format Control | ||
| Fine-tune the detail level of validation results: | ||
| ```bash | ||
| # Control partial unexpected list size | ||
| truthound check data.csv --rf basic --partial-unexpected-count 50 | ||
| # Include row indices for unexpected values | ||
| truthound check data.csv --rf summary --include-unexpected-index | ||
| # Include Polars debug query in results (for troubleshooting) | ||
| truthound check data.csv --rf complete --return-debug-query | ||
| # All fine-grained options combined | ||
| truthound check data.csv --rf complete \ | ||
| --include-unexpected-rows \ | ||
| --max-unexpected-rows 500 \ | ||
| --partial-unexpected-count 100 \ | ||
| --include-unexpected-index \ | ||
| --return-debug-query | ||
| ``` | ||
| ### Database Validation | ||
| Validate data directly from a database connection: | ||
| ```bash | ||
| # Validate a PostgreSQL table | ||
| truthound check --connection "postgresql://user:pass@host/db" --table users | ||
| # Validate with a SQL query | ||
| truthound check --connection "sqlite:///data.db" --query "SELECT * FROM orders WHERE status = 'active'" | ||
| # Validate using a source config file | ||
| truthound check --source-config db_config.yaml --strict | ||
| # Combine with other options | ||
| truthound check --connection "postgresql://user:pass@host/db" --table users \ | ||
| -v null,unique --rf summary --strict | ||
| ``` | ||
| ### Query Pushdown | ||
| For SQL data sources, enable server-side validation: | ||
| ```bash | ||
| # Auto-detect pushdown capability | ||
| truthound check data.csv --pushdown | ||
| # Explicitly disable pushdown | ||
| truthound check data.csv --no-pushdown | ||
| ``` | ||
| ### Exception Handling (VE-5) | ||
@@ -151,0 +244,0 @@ |
@@ -18,2 +18,9 @@ # truthound compare | ||
| ## Data Source Options | ||
| | Option | Short | Default | Description | | ||
| |--------|-------|---------|-------------| | ||
| | `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) for dual-source comparison | | ||
| | `--source-name` | | None | Custom label for the data source | | ||
| ## Options | ||
@@ -26,2 +33,3 @@ | ||
| | `--threshold` | `-t` | Auto | Custom drift threshold | | ||
| | `--sample-size` | `--sample` | None | Sample size for large datasets (random sampling) | | ||
| | `--format` | `-f` | `console` | Output format (console, json) | | ||
@@ -133,2 +141,15 @@ | `--output` | `-o` | None | Output file path | | ||
| ### Large Dataset Sampling | ||
| For large datasets, use sampling for faster comparison: | ||
| ```bash | ||
| # Sample 10,000 rows from each dataset | ||
| truthound compare big_train.csv big_prod.csv --sample-size 10000 | ||
| # Combine with method and threshold | ||
| truthound compare large_baseline.parquet large_current.parquet \ | ||
| --sample-size 50000 --method psi --threshold 0.15 | ||
| ``` | ||
| ### Custom Threshold | ||
@@ -135,0 +156,0 @@ |
@@ -14,2 +14,3 @@ # Core Commands | ||
| | [`profile`](profile.md) | Generate data profile | Data exploration | | ||
| | [`read`](read.md) | Read and preview data | Data inspection | | ||
| | [`compare`](compare.md) | Detect data drift | Model monitoring | | ||
@@ -114,2 +115,27 @@ | ||
| ## Data Source Options | ||
| All core commands accept data source options for reading directly from databases instead of files. When using these options, the file argument becomes optional. | ||
| | Option | Short | Description | | ||
| |--------|-------|-------------| | ||
| | `--connection` | `--conn` | Database connection string (e.g., `postgresql://user:pass@host/db`) | | ||
| | `--table` | | Database table name | | ||
| | `--query` | | SQL query (alternative to `--table`) | | ||
| | `--source-config` | `--sc` | Path to a data source config file (JSON/YAML) | | ||
| | `--source-name` | | Custom label for the data source | | ||
| ```bash | ||
| # Validate a database table directly | ||
| truthound check --connection "postgresql://user:pass@host/db" --table users --strict | ||
| # Profile from a source config file | ||
| truthound profile --source-config prod_db.yaml | ||
| # Read and preview database data | ||
| truthound read --connection "sqlite:///data.db" --table orders --head 20 | ||
| ``` | ||
| For full details on connection string formats, config files, and security best practices, see the [CLI Data Source Guide](../../guides/datasources/cli-datasource-guide.md). | ||
| ## CI/CD Integration | ||
@@ -130,2 +156,3 @@ | ||
| - [read](read.md) - Read and preview data | ||
| - [learn](learn.md) - Learn schema from data | ||
@@ -132,0 +159,0 @@ - [check](check.md) - Validate data quality |
@@ -8,3 +8,3 @@ # truthound learn | ||
| ```bash | ||
| truthound learn <file> [OPTIONS] | ||
| truthound learn [FILE] [OPTIONS] | ||
| ``` | ||
@@ -16,4 +16,14 @@ | ||
| |----------|----------|-------------| | ||
| | `file` | Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) | | ||
| | `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) | | ||
| ## Data Source Options | ||
| | Option | Short | Default | Description | | ||
| |--------|-------|---------|-------------| | ||
| | `--connection` | `--conn` | None | Database connection string | | ||
| | `--table` | | None | Database table name | | ||
| | `--query` | | None | SQL query (alternative to `--table`) | | ||
| | `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) | | ||
| | `--source-name` | | None | Custom label for the data source | | ||
| ## Options | ||
@@ -25,2 +35,3 @@ | ||
| | `--no-constraints` | | `false` | Don't infer constraints from data | | ||
| | `--categorical-threshold` | | `20` | Maximum unique values to treat a column as categorical | | ||
@@ -73,2 +84,19 @@ ## Description | ||
| ### Categorical Threshold | ||
| Control when columns are treated as categorical: | ||
| ```bash | ||
| # Default: columns with <= 20 unique values are categorical | ||
| truthound learn data.csv | ||
| # Higher threshold: treat columns with up to 50 unique values as categorical | ||
| truthound learn data.csv --categorical-threshold 50 | ||
| # Lower threshold: only truly low-cardinality columns | ||
| truthound learn data.csv --categorical-threshold 5 | ||
| ``` | ||
| Columns classified as categorical will have `allowed_values` in the generated schema, enabling strict enum validation. | ||
| ### From Different File Formats | ||
@@ -75,0 +103,0 @@ |
@@ -8,3 +8,3 @@ # truthound mask | ||
| ```bash | ||
| truthound mask <file> -o <output> [OPTIONS] | ||
| truthound mask [FILE] -o <output> [OPTIONS] | ||
| ``` | ||
@@ -16,4 +16,14 @@ | ||
| |----------|----------|-------------| | ||
| | `file` | Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) | | ||
| | `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) | | ||
| ## Data Source Options | ||
| | Option | Short | Default | Description | | ||
| |--------|-------|---------|-------------| | ||
| | `--connection` | `--conn` | None | Database connection string | | ||
| | `--table` | | None | Database table name | | ||
| | `--query` | | None | SQL query (alternative to `--table`) | | ||
| | `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) | | ||
| | `--source-name` | | None | Custom label for the data source | | ||
| ## Options | ||
@@ -20,0 +30,0 @@ |
@@ -8,3 +8,3 @@ # truthound profile | ||
| ```bash | ||
| truthound profile <file> [OPTIONS] | ||
| truthound profile [FILE] [OPTIONS] | ||
| ``` | ||
@@ -16,4 +16,14 @@ | ||
| |----------|----------|-------------| | ||
| | `file` | Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) | | ||
| | `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) | | ||
| ## Data Source Options | ||
| | Option | Short | Default | Description | | ||
| |--------|-------|---------|-------------| | ||
| | `--connection` | `--conn` | None | Database connection string | | ||
| | `--table` | | None | Database table name | | ||
| | `--query` | | None | SQL query (alternative to `--table`) | | ||
| | `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) | | ||
| | `--source-name` | | None | Custom label for the data source | | ||
| ## Options | ||
@@ -20,0 +30,0 @@ |
@@ -8,3 +8,3 @@ # truthound scan | ||
| ```bash | ||
| truthound scan <file> [OPTIONS] | ||
| truthound scan [FILE] [OPTIONS] | ||
| ``` | ||
@@ -16,4 +16,14 @@ | ||
| |----------|----------|-------------| | ||
| | `file` | Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) | | ||
| | `file` | No | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) | | ||
| ## Data Source Options | ||
| | Option | Short | Default | Description | | ||
| |--------|-------|---------|-------------| | ||
| | `--connection` | `--conn` | None | Database connection string | | ||
| | `--table` | | None | Database table name | | ||
| | `--query` | | None | SQL query (alternative to `--table`) | | ||
| | `--source-config` | `--sc` | None | Path to data source config file (JSON/YAML) | | ||
| | `--source-name` | | None | Custom label for the data source | | ||
| ## Options | ||
@@ -20,0 +30,0 @@ |
+19
-5
| Metadata-Version: 2.4 | ||
| Name: truthound | ||
| Version: 1.3.2 | ||
| Version: 1.5.0 | ||
| Summary: Zero-Configuration Data Quality Framework Powered by Polars | ||
@@ -145,3 +145,3 @@ Project-URL: Homepage, https://github.com/seadonggyun4/Truthound | ||
| | Test Cases | 8,585+ | | ||
| | Validators | 264 | | ||
| | Validators | 289 | | ||
| | Validator Categories | 28 | | ||
@@ -208,6 +208,17 @@ | VE Test Cases | 316 (Validation Engine Enhancement) | | ||
| truthound check data.csv --catch-exceptions --max-retries 2 # Resilient mode | ||
| truthound check data.csv --parallel --max-workers 8 # DAG parallel execution | ||
| truthound check data.csv --return-debug-query --rf complete # Debug query output | ||
| truthound compare baseline.csv current.csv # Drift detection | ||
| truthound compare big.csv new.csv --sample-size 10000 # Sampled comparison | ||
| truthound learn data.csv --categorical-threshold 50 # Custom threshold | ||
| truthound scan data.csv # PII scanning | ||
| truthound auto-profile data.csv # Profiling | ||
| truthound new validator my_validator # Code scaffolding | ||
| # Database connections (all core commands support --connection/--table) | ||
| truthound check --connection "postgresql://user:pass@host/db" --table users | ||
| truthound scan --connection "sqlite:///data.db" --table orders | ||
| truthound read --connection "postgresql://host/db" --table users --head 20 | ||
| truthound read data.csv --schema-only # Inspect schema | ||
| truthound compare --source-config drift.yaml # Dual-source drift detection | ||
| ``` | ||
@@ -223,9 +234,12 @@ | ||
| |---------|-------------|-------------| | ||
| | `learn` | Learn schema from data | `--output`, `--no-constraints` | | ||
| | `check` | Validate data quality | `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries` | | ||
| | `read` | Read and preview data | `--head`, `--sample`, `--columns`, `--schema-only`, `--count-only`, `--format` | | ||
| | `learn` | Learn schema from data | `--output`, `--no-constraints`, `--categorical-threshold` | | ||
| | `check` | Validate data quality | `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries`, `--parallel`, `--max-workers`, `--pushdown`, `--partial-unexpected-count`, `--return-debug-query`, `--include-unexpected-index` | | ||
| | `scan` | Scan for PII | `--format`, `--output` | | ||
| | `mask` | Mask sensitive data | `--columns`, `--strategy` (redact/hash/fake), `--strict` | | ||
| | `profile` | Generate data profile | `--format`, `--output` | | ||
| | `compare` | Detect data drift | `--method` (auto/ks/psi/chi2/js), `--threshold`, `--strict` | | ||
| | `compare` | Detect data drift | `--method` (14 methods), `--threshold`, `--sample-size`, `--strict` | | ||
| All core commands accept **Data Source Options**: `--connection`/`--conn`, `--table`, `--query`, `--source-config`/`--sc`, `--source-name` for database connectivity (PostgreSQL, MySQL, SQLite, DuckDB, SQL Server, etc.). | ||
| ### Profiler Commands | ||
@@ -232,0 +246,0 @@ |
+1
-1
@@ -7,3 +7,3 @@ [build-system] | ||
| name = "truthound" | ||
| version = "1.3.2" | ||
| version = "1.5.0" | ||
| description = "Zero-Configuration Data Quality Framework Powered by Polars" | ||
@@ -10,0 +10,0 @@ readme = "README.md" |
+18
-4
@@ -48,3 +48,3 @@ <div align="center"> | ||
| | Test Cases | 8,585+ | | ||
| | Validators | 264 | | ||
| | Validators | 289 | | ||
| | Validator Categories | 28 | | ||
@@ -111,6 +111,17 @@ | VE Test Cases | 316 (Validation Engine Enhancement) | | ||
| truthound check data.csv --catch-exceptions --max-retries 2 # Resilient mode | ||
| truthound check data.csv --parallel --max-workers 8 # DAG parallel execution | ||
| truthound check data.csv --return-debug-query --rf complete # Debug query output | ||
| truthound compare baseline.csv current.csv # Drift detection | ||
| truthound compare big.csv new.csv --sample-size 10000 # Sampled comparison | ||
| truthound learn data.csv --categorical-threshold 50 # Custom threshold | ||
| truthound scan data.csv # PII scanning | ||
| truthound auto-profile data.csv # Profiling | ||
| truthound new validator my_validator # Code scaffolding | ||
| # Database connections (all core commands support --connection/--table) | ||
| truthound check --connection "postgresql://user:pass@host/db" --table users | ||
| truthound scan --connection "sqlite:///data.db" --table orders | ||
| truthound read --connection "postgresql://host/db" --table users --head 20 | ||
| truthound read data.csv --schema-only # Inspect schema | ||
| truthound compare --source-config drift.yaml # Dual-source drift detection | ||
| ``` | ||
@@ -126,9 +137,12 @@ | ||
| |---------|-------------|-------------| | ||
| | `learn` | Learn schema from data | `--output`, `--no-constraints` | | ||
| | `check` | Validate data quality | `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries` | | ||
| | `read` | Read and preview data | `--head`, `--sample`, `--columns`, `--schema-only`, `--count-only`, `--format` | | ||
| | `learn` | Learn schema from data | `--output`, `--no-constraints`, `--categorical-threshold` | | ||
| | `check` | Validate data quality | `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries`, `--parallel`, `--max-workers`, `--pushdown`, `--partial-unexpected-count`, `--return-debug-query`, `--include-unexpected-index` | | ||
| | `scan` | Scan for PII | `--format`, `--output` | | ||
| | `mask` | Mask sensitive data | `--columns`, `--strategy` (redact/hash/fake), `--strict` | | ||
| | `profile` | Generate data profile | `--format`, `--output` | | ||
| | `compare` | Detect data drift | `--method` (auto/ks/psi/chi2/js), `--threshold`, `--strict` | | ||
| | `compare` | Detect data drift | `--method` (14 methods), `--threshold`, `--sample-size`, `--strict` | | ||
| All core commands accept **Data Source Options**: `--connection`/`--conn`, `--table`, `--query`, `--source-config`/`--sc`, `--source-name` for database connectivity (PostgreSQL, MySQL, SQLite, DuckDB, SQL Server, etc.). | ||
| ### Profiler Commands | ||
@@ -135,0 +149,0 @@ |
@@ -59,3 +59,7 @@ """CLI error handling utilities. | ||
| # DataSource errors (55-59) | ||
| DATASOURCE_ERROR = 55 | ||
| DATASOURCE_CONNECTION_ERROR = 56 | ||
| # ============================================================================= | ||
@@ -226,2 +230,26 @@ # Exception Classes | ||
| class DataSourceError(CLIError): | ||
| """Error with data source connection or configuration.""" | ||
| def __init__( | ||
| self, | ||
| message: str, | ||
| source_type: str | None = None, | ||
| hint: str | None = None, | ||
| ) -> None: | ||
| """Initialize data source error. | ||
| Args: | ||
| message: Error message | ||
| source_type: Type of data source (e.g., "postgresql", "mysql") | ||
| hint: Resolution hint | ||
| """ | ||
| super().__init__( | ||
| message=message, | ||
| code=ErrorCode.DATASOURCE_ERROR, | ||
| details={"source_type": source_type} if source_type else {}, | ||
| hint=hint or "Check the connection string, credentials, and table name.", | ||
| ) | ||
| # ============================================================================= | ||
@@ -228,0 +256,0 @@ # Error Handler |
@@ -340,3 +340,92 @@ """Reusable CLI options and arguments. | ||
| # Parallel execution (DAG-based) | ||
| ParallelOpt = Annotated[ | ||
| bool, | ||
| typer.Option( | ||
| "--parallel/--no-parallel", | ||
| help=( | ||
| "Enable DAG-based parallel execution. " | ||
| "Validators are grouped by dependency level and executed concurrently." | ||
| ), | ||
| ), | ||
| ] | ||
| # Max workers for parallel execution | ||
| MaxWorkersOpt = Annotated[ | ||
| int | None, | ||
| typer.Option( | ||
| "--max-workers", | ||
| help=( | ||
| "Maximum worker threads for parallel execution. " | ||
| "Only effective with --parallel. " | ||
| "Defaults to min(32, cpu_count + 4)." | ||
| ), | ||
| min=1, | ||
| ), | ||
| ] | ||
| # Query pushdown for SQL data sources | ||
| PushdownOpt = Annotated[ | ||
| bool | None, | ||
| typer.Option( | ||
| "--pushdown/--no-pushdown", | ||
| help=( | ||
| "Enable query pushdown for SQL data sources. " | ||
| "Validation logic is executed server-side when possible. " | ||
| "Default: auto-detect based on data source type." | ||
| ), | ||
| ), | ||
| ] | ||
| # Execution engine (experimental) | ||
| UseEngineOpt = Annotated[ | ||
| bool, | ||
| typer.Option( | ||
| "--use-engine/--no-use-engine", | ||
| help="Use execution engine for validation (experimental).", | ||
| ), | ||
| ] | ||
| # Partial unexpected count | ||
| PartialUnexpectedCountOpt = Annotated[ | ||
| int, | ||
| typer.Option( | ||
| "--partial-unexpected-count", | ||
| help="Maximum number of unexpected values in partial list (BASIC+).", | ||
| min=0, | ||
| ), | ||
| ] | ||
| # Include unexpected index | ||
| IncludeUnexpectedIndexOpt = Annotated[ | ||
| bool, | ||
| typer.Option( | ||
| "--include-unexpected-index", | ||
| help="Include row index for each unexpected value in results.", | ||
| ), | ||
| ] | ||
| # Return debug query | ||
| ReturnDebugQueryOpt = Annotated[ | ||
| bool, | ||
| typer.Option( | ||
| "--return-debug-query", | ||
| help="Include Polars debug query expression in results (COMPLETE level).", | ||
| ), | ||
| ] | ||
| # Categorical threshold for schema learning | ||
| CategoricalThresholdOpt = Annotated[ | ||
| int, | ||
| typer.Option( | ||
| "--categorical-threshold", | ||
| help=( | ||
| "Maximum unique values to treat a column as categorical " | ||
| "during schema inference." | ||
| ), | ||
| min=1, | ||
| ), | ||
| ] | ||
| # ============================================================================= | ||
@@ -343,0 +432,0 @@ # Option Groups (for related options) |
| """Core CLI commands for Truthound. | ||
| This package contains the fundamental CLI commands: | ||
| - read: Read and preview data | ||
| - learn: Learn schema from data files | ||
@@ -14,2 +15,3 @@ - check: Validate data quality | ||
| from truthound.cli_modules.core.read import read_cmd | ||
| from truthound.cli_modules.core.learn import learn_cmd | ||
@@ -32,2 +34,3 @@ from truthound.cli_modules.core.check import check_cmd | ||
| """ | ||
| parent_app.command(name="read")(read_cmd) | ||
| parent_app.command(name="learn")(learn_cmd) | ||
@@ -44,2 +47,3 @@ parent_app.command(name="check")(check_cmd) | ||
| "register_commands", | ||
| "read_cmd", | ||
| "learn_cmd", | ||
@@ -46,0 +50,0 @@ "check_cmd", |
| """Check command - Validate data quality. | ||
| This module implements the `truthound check` command for validating | ||
| data quality in files. | ||
| This module implements the ``truthound check`` command for validating | ||
| data quality in files and database tables. | ||
| """ | ||
@@ -14,2 +14,10 @@ | ||
| from truthound.cli_modules.common.datasource import ( | ||
| ConnectionOpt, | ||
| QueryOpt, | ||
| SourceConfigOpt, | ||
| SourceNameOpt, | ||
| TableOpt, | ||
| resolve_datasource, | ||
| ) | ||
| from truthound.cli_modules.common.errors import error_boundary, require_file | ||
@@ -22,5 +30,14 @@ from truthound.cli_modules.common.options import parse_list_callback | ||
| file: Annotated[ | ||
| Path, | ||
| typer.Argument(help="Path to the data file"), | ||
| ], | ||
| Optional[Path], | ||
| typer.Argument( | ||
| help="Path to the data file (CSV, JSON, Parquet, NDJSON)", | ||
| ), | ||
| ] = None, | ||
| # -- DataSource Options -- | ||
| connection: ConnectionOpt = None, | ||
| table: TableOpt = None, | ||
| query: QueryOpt = None, | ||
| source_config: SourceConfigOpt = None, | ||
| source_name: SourceNameOpt = None, | ||
| # -- Validator Options -- | ||
| validators: Annotated[ | ||
@@ -74,2 +91,23 @@ Optional[list[str]], | ||
| ] = 1000, | ||
| partial_unexpected_count: Annotated[ | ||
| int, | ||
| typer.Option( | ||
| "--partial-unexpected-count", | ||
| help="Maximum number of unexpected values in partial list (BASIC+)", | ||
| ), | ||
| ] = 20, | ||
| include_unexpected_index: Annotated[ | ||
| bool, | ||
| typer.Option( | ||
| "--include-unexpected-index", | ||
| help="Include row index for each unexpected value in results", | ||
| ), | ||
| ] = False, | ||
| return_debug_query: Annotated[ | ||
| bool, | ||
| typer.Option( | ||
| "--return-debug-query", | ||
| help="Include Polars debug query expression in results (COMPLETE level)", | ||
| ), | ||
| ] = False, | ||
| catch_exceptions: Annotated[ | ||
@@ -109,7 +147,48 @@ bool, | ||
| ] = None, | ||
| # -- Execution Options -- | ||
| parallel: Annotated[ | ||
| bool, | ||
| typer.Option( | ||
| "--parallel/--no-parallel", | ||
| help=( | ||
| "Enable DAG-based parallel execution. " | ||
| "Validators are grouped by dependency level and executed concurrently." | ||
| ), | ||
| ), | ||
| ] = False, | ||
| max_workers: Annotated[ | ||
| Optional[int], | ||
| typer.Option( | ||
| "--max-workers", | ||
| help=( | ||
| "Maximum worker threads for parallel execution. " | ||
| "Only effective with --parallel. Defaults to min(32, cpu_count + 4)." | ||
| ), | ||
| min=1, | ||
| ), | ||
| ] = None, | ||
| pushdown: Annotated[ | ||
| Optional[bool], | ||
| typer.Option( | ||
| "--pushdown/--no-pushdown", | ||
| help=( | ||
| "Enable query pushdown for SQL data sources. " | ||
| "Validation logic is executed server-side when possible. " | ||
| "Default: auto-detect based on data source type." | ||
| ), | ||
| ), | ||
| ] = None, | ||
| use_engine: Annotated[ | ||
| bool, | ||
| typer.Option( | ||
| "--use-engine/--no-use-engine", | ||
| help="Use execution engine for validation (experimental).", | ||
| ), | ||
| ] = False, | ||
| ) -> None: | ||
| """Validate data quality in a file. | ||
| """Validate data quality in a file or database table. | ||
| This command runs data quality validators on the specified file and | ||
| reports any issues found. | ||
| This command runs data quality validators on the specified data | ||
| and reports any issues found. Supports file paths, database | ||
| connections, and source config files. | ||
@@ -123,9 +202,7 @@ Examples: | ||
| truthound check data.csv --result-format complete | ||
| truthound check data.csv --rf boolean_only | ||
| truthound check data.csv --no-catch-exceptions | ||
| truthound check data.csv --max-retries 3 | ||
| truthound check data.csv --show-exceptions --format json | ||
| truthound check --connection "postgresql://user:pass@host/db" --table users | ||
| truthound check --conn "sqlite:///data.db" --table orders --pushdown | ||
| truthound check --source-config db.yaml --strict | ||
| truthound check data.csv --parallel --max-workers 8 | ||
| truthound check data.csv --exclude-columns first_name,last_name | ||
| truthound check data.csv --validator-config '{"unique": {"exclude_columns": ["first_name"]}}' | ||
| truthound check data.csv --validator-config config.json | ||
| """ | ||
@@ -135,4 +212,12 @@ from truthound.api import check | ||
| # Validate files exist | ||
| require_file(file) | ||
| # Resolve data source | ||
| data_path, source = resolve_datasource( | ||
| file=file, | ||
| connection=connection, | ||
| table=table, | ||
| query=query, | ||
| source_config=source_config, | ||
| source_name=source_name, | ||
| ) | ||
| if schema_file: | ||
@@ -192,24 +277,45 @@ require_file(schema_file, "Schema file") | ||
| # Build result_format config | ||
| rf_config: str | ResultFormatConfig = result_format | ||
| if include_unexpected_rows or max_unexpected_rows != 1000: | ||
| # Build result_format config — include all fine-grained parameters | ||
| has_custom_rf = ( | ||
| include_unexpected_rows | ||
| or max_unexpected_rows != 1000 | ||
| or partial_unexpected_count != 20 | ||
| or include_unexpected_index | ||
| or return_debug_query | ||
| ) | ||
| rf_config: str | ResultFormatConfig | ||
| if has_custom_rf: | ||
| rf_config = ResultFormatConfig( | ||
| format=ResultFormat.from_string(result_format), | ||
| partial_unexpected_count=partial_unexpected_count, | ||
| include_unexpected_rows=include_unexpected_rows, | ||
| max_unexpected_rows=max_unexpected_rows, | ||
| include_unexpected_index=include_unexpected_index, | ||
| return_debug_query=return_debug_query, | ||
| ) | ||
| else: | ||
| rf_config = result_format | ||
| # Build API call kwargs | ||
| check_kwargs: dict[str, Any] = { | ||
| "validators": validator_list, | ||
| "validator_config": v_config, | ||
| "min_severity": min_severity, | ||
| "schema": schema_file, | ||
| "auto_schema": auto_schema, | ||
| "result_format": rf_config, | ||
| "catch_exceptions": catch_exceptions, | ||
| "max_retries": max_retries, | ||
| "exclude_columns": exclude_cols, | ||
| "parallel": parallel, | ||
| "max_workers": max_workers, | ||
| "pushdown": pushdown, | ||
| "use_engine": use_engine, | ||
| } | ||
| try: | ||
| report = check( | ||
| str(file), | ||
| validators=validator_list, | ||
| validator_config=v_config, | ||
| min_severity=min_severity, | ||
| schema=schema_file, | ||
| auto_schema=auto_schema, | ||
| result_format=rf_config, | ||
| catch_exceptions=catch_exceptions, | ||
| max_retries=max_retries, | ||
| exclude_columns=exclude_cols, | ||
| ) | ||
| if source is not None: | ||
| report = check(source=source, **check_kwargs) | ||
| else: | ||
| report = check(data_path, **check_kwargs) | ||
| except Exception as e: | ||
@@ -219,2 +325,5 @@ typer.echo(f"Error: {e}", err=True) | ||
| # Determine label for HTML report title | ||
| report_label = source_name or (source.name if source else str(file)) | ||
| # Output the report | ||
@@ -236,3 +345,3 @@ if format == "json": | ||
| html = generate_html_report(report, title=f"Validation Report: {file.name}") | ||
| html = generate_html_report(report, title=f"Validation Report: {report_label}") | ||
| output.write_text(html, encoding="utf-8") | ||
@@ -239,0 +348,0 @@ typer.echo(f"HTML report written to {output}") |
| """Compare command - Compare datasets for drift. | ||
| This module implements the `truthound compare` command for detecting | ||
| data drift between two datasets. | ||
| This module implements the ``truthound compare`` command for detecting | ||
| data drift between two datasets from files or database tables. | ||
| """ | ||
@@ -14,3 +14,7 @@ | ||
| from truthound.cli_modules.common.errors import error_boundary, require_file | ||
| from truthound.cli_modules.common.datasource import ( | ||
| SourceConfigOpt, | ||
| resolve_compare_sources, | ||
| ) | ||
| from truthound.cli_modules.common.errors import error_boundary | ||
| from truthound.cli_modules.common.options import parse_list_callback | ||
@@ -22,9 +26,16 @@ | ||
| baseline: Annotated[ | ||
| Path, | ||
| typer.Argument(help="Baseline (reference) data file"), | ||
| ], | ||
| Optional[Path], | ||
| typer.Argument( | ||
| help="Baseline (reference) data file", | ||
| ), | ||
| ] = None, | ||
| current: Annotated[ | ||
| Path, | ||
| typer.Argument(help="Current data file to compare"), | ||
| ], | ||
| Optional[Path], | ||
| typer.Argument( | ||
| help="Current data file to compare", | ||
| ), | ||
| ] = None, | ||
| # -- DataSource Config (for database-to-database comparison) -- | ||
| source_config: SourceConfigOpt = None, | ||
| # -- Compare Options -- | ||
| columns: Annotated[ | ||
@@ -36,3 +47,10 @@ Optional[list[str]], | ||
| str, | ||
| typer.Option("--method", "-m", help="Detection method (auto, ks, psi, chi2, js)"), | ||
| typer.Option( | ||
| "--method", | ||
| "-m", | ||
| help=( | ||
| "Detection method: auto, ks, psi, chi2, js, kl, wasserstein, " | ||
| "cvm, anderson, hellinger, bhattacharyya, tv, energy, mmd" | ||
| ), | ||
| ), | ||
| ] = "auto", | ||
@@ -43,2 +61,11 @@ threshold: Annotated[ | ||
| ] = None, | ||
| sample_size: Annotated[ | ||
| Optional[int], | ||
| typer.Option( | ||
| "--sample-size", | ||
| "--sample", | ||
| help="Sample size for large datasets. Uses random sampling for faster comparison.", | ||
| min=1, | ||
| ), | ||
| ] = None, | ||
| format: Annotated[ | ||
@@ -60,11 +87,29 @@ str, | ||
| This command compares a baseline dataset with a current dataset and | ||
| detects statistical drift in column distributions. | ||
| detects statistical drift in column distributions. Supports file | ||
| paths or a --source-config for database-to-database comparison. | ||
| Detection Methods: | ||
| - auto: Automatically select best method per column | ||
| - auto: Automatically select best method per column (recommended) | ||
| - ks: Kolmogorov-Smirnov test (numeric) | ||
| - psi: Population Stability Index | ||
| - psi: Population Stability Index (ML monitoring) | ||
| - chi2: Chi-squared test (categorical) | ||
| - js: Jensen-Shannon divergence | ||
| - js: Jensen-Shannon divergence (any type) | ||
| - kl: Kullback-Leibler divergence (numeric) | ||
| - wasserstein: Earth Mover's distance (numeric) | ||
| - cvm: Cramer-von Mises test (numeric, tail-sensitive) | ||
| - anderson: Anderson-Darling test (numeric, extreme values) | ||
| - hellinger: Hellinger distance (bounded metric) | ||
| - bhattacharyya: Bhattacharyya distance (classification bounds) | ||
| - tv: Total Variation distance (max probability diff) | ||
| - energy: Energy distance (location/scale) | ||
| - mmd: Maximum Mean Discrepancy (high-dimensional) | ||
| Source Config Format (YAML): | ||
| baseline: | ||
| connection: "postgresql://user:pass@host/db" | ||
| table: train_data | ||
| current: | ||
| connection: "postgresql://user:pass@host/db" | ||
| table: production_data | ||
| Examples: | ||
@@ -75,8 +120,15 @@ truthound compare baseline.csv current.csv | ||
| truthound compare old.csv new.csv --columns price,quantity | ||
| truthound compare --source-config drift_config.yaml --method ks | ||
| truthound compare big_train.csv big_prod.csv --sample-size 10000 | ||
| """ | ||
| from truthound.drift import compare | ||
| # Validate files exist | ||
| require_file(baseline, "Baseline file") | ||
| require_file(current, "Current file") | ||
| # Resolve both data sources | ||
| (baseline_path, baseline_source), (current_path, current_source) = ( | ||
| resolve_compare_sources( | ||
| baseline=baseline, | ||
| current=current, | ||
| source_config=source_config, | ||
| ) | ||
| ) | ||
@@ -86,9 +138,22 @@ # Parse columns if provided | ||
| # Determine inputs for the compare API | ||
| baseline_input = ( | ||
| baseline_source.to_polars_lazyframe().collect() | ||
| if baseline_source | ||
| else baseline_path | ||
| ) | ||
| current_input = ( | ||
| current_source.to_polars_lazyframe().collect() | ||
| if current_source | ||
| else current_path | ||
| ) | ||
| try: | ||
| drift_report = compare( | ||
| str(baseline), | ||
| str(current), | ||
| baseline_input, | ||
| current_input, | ||
| columns=column_list, | ||
| method=method, | ||
| threshold=threshold, | ||
| sample_size=sample_size, | ||
| ) | ||
@@ -95,0 +160,0 @@ except Exception as e: |
@@ -1,5 +0,5 @@ | ||
| """Learn command - Learn schema from data files. | ||
| """Learn command - Learn schema from data. | ||
| This module implements the `truthound learn` command for inferring | ||
| schema from data files. | ||
| This module implements the ``truthound learn`` command for inferring | ||
| schema from data files and database tables. | ||
| """ | ||
@@ -10,7 +10,15 @@ | ||
| from pathlib import Path | ||
| from typing import Annotated | ||
| from typing import Annotated, Optional | ||
| import typer | ||
| from truthound.cli_modules.common.errors import error_boundary, require_file | ||
| from truthound.cli_modules.common.datasource import ( | ||
| ConnectionOpt, | ||
| QueryOpt, | ||
| SourceConfigOpt, | ||
| SourceNameOpt, | ||
| TableOpt, | ||
| resolve_datasource, | ||
| ) | ||
| from truthound.cli_modules.common.errors import error_boundary | ||
@@ -21,5 +29,14 @@ | ||
| file: Annotated[ | ||
| Path, | ||
| typer.Argument(help="Path to the data file to learn from"), | ||
| ], | ||
| Optional[Path], | ||
| typer.Argument( | ||
| help="Path to the data file to learn from", | ||
| ), | ||
| ] = None, | ||
| # -- DataSource Options -- | ||
| connection: ConnectionOpt = None, | ||
| table: TableOpt = None, | ||
| query: QueryOpt = None, | ||
| source_config: SourceConfigOpt = None, | ||
| source_name: SourceNameOpt = None, | ||
| # -- Schema Options -- | ||
| output: Annotated[ | ||
@@ -33,6 +50,17 @@ Path, | ||
| ] = False, | ||
| categorical_threshold: Annotated[ | ||
| int, | ||
| typer.Option( | ||
| "--categorical-threshold", | ||
| help=( | ||
| "Maximum unique values to treat a column as categorical " | ||
| "during schema inference (default: 20)" | ||
| ), | ||
| min=1, | ||
| ), | ||
| ] = 20, | ||
| ) -> None: | ||
| """Learn schema from a data file. | ||
| """Learn schema from a data file or database table. | ||
| This command analyzes the data file and generates a schema definition | ||
| This command analyzes the data and generates a schema definition | ||
| that captures column types, constraints, and patterns. | ||
@@ -44,10 +72,31 @@ | ||
| truthound learn data.csv --no-constraints | ||
| truthound learn data.csv --categorical-threshold 50 | ||
| truthound learn --connection "postgresql://user:pass@host/db" --table users | ||
| truthound learn --source-config db.yaml -o db_schema.yaml | ||
| """ | ||
| from truthound.schema import learn | ||
| # Validate file exists | ||
| require_file(file) | ||
| # Resolve data source | ||
| data_path, source = resolve_datasource( | ||
| file=file, | ||
| connection=connection, | ||
| table=table, | ||
| query=query, | ||
| source_config=source_config, | ||
| source_name=source_name, | ||
| ) | ||
| try: | ||
| schema = learn(str(file), infer_constraints=not no_constraints) | ||
| if source is not None: | ||
| schema = learn( | ||
| source=source, | ||
| infer_constraints=not no_constraints, | ||
| categorical_threshold=categorical_threshold, | ||
| ) | ||
| else: | ||
| schema = learn( | ||
| data_path, | ||
| infer_constraints=not no_constraints, | ||
| categorical_threshold=categorical_threshold, | ||
| ) | ||
| schema.save(output) | ||
@@ -54,0 +103,0 @@ |
| """Mask command - Mask sensitive data. | ||
| This module implements the `truthound mask` command for masking | ||
| sensitive data in files. | ||
| This module implements the ``truthound mask`` command for masking | ||
| sensitive data in files and database tables. | ||
| """ | ||
@@ -14,3 +14,11 @@ | ||
| from truthound.cli_modules.common.errors import error_boundary, require_file | ||
| from truthound.cli_modules.common.datasource import ( | ||
| ConnectionOpt, | ||
| QueryOpt, | ||
| SourceConfigOpt, | ||
| SourceNameOpt, | ||
| TableOpt, | ||
| resolve_datasource, | ||
| ) | ||
| from truthound.cli_modules.common.errors import error_boundary | ||
| from truthound.cli_modules.common.options import parse_list_callback | ||
@@ -22,9 +30,18 @@ | ||
| file: Annotated[ | ||
| Path, | ||
| typer.Argument(help="Path to the data file"), | ||
| ], | ||
| Optional[Path], | ||
| typer.Argument( | ||
| help="Path to the data file (CSV, JSON, Parquet, NDJSON)", | ||
| ), | ||
| ] = None, | ||
| # -- DataSource Options -- | ||
| connection: ConnectionOpt = None, | ||
| table: TableOpt = None, | ||
| query: QueryOpt = None, | ||
| source_config: SourceConfigOpt = None, | ||
| source_name: SourceNameOpt = None, | ||
| # -- Mask Options -- | ||
| output: Annotated[ | ||
| Path, | ||
| typer.Option("--output", "-o", help="Output file path"), | ||
| ], | ||
| ] = ..., | ||
| columns: Annotated[ | ||
@@ -46,5 +63,5 @@ Optional[list[str]], | ||
| ) -> None: | ||
| """Mask sensitive data in a file. | ||
| """Mask sensitive data in a file or database table. | ||
| This command creates a copy of the data file with sensitive columns | ||
| This command creates a copy of the data with sensitive columns | ||
| masked using the specified strategy. | ||
@@ -61,3 +78,4 @@ | ||
| truthound mask data.csv -o masked.csv --strategy hash | ||
| truthound mask data.csv -o masked.csv --columns email --strict | ||
| truthound mask --connection "postgresql://user:pass@host/db" --table users -o masked.csv | ||
| truthound mask --source-config db.yaml -o masked.parquet | ||
| """ | ||
@@ -68,4 +86,11 @@ import warnings | ||
| # Validate file exists | ||
| require_file(file) | ||
| # Resolve data source | ||
| data_path, source = resolve_datasource( | ||
| file=file, | ||
| connection=connection, | ||
| table=table, | ||
| query=query, | ||
| source_config=source_config, | ||
| source_name=source_name, | ||
| ) | ||
@@ -79,3 +104,6 @@ # Parse columns if provided | ||
| warnings.simplefilter("always", MaskingWarning) | ||
| masked_df = mask(str(file), columns=column_list, strategy=strategy, strict=strict) | ||
| if source is not None: | ||
| masked_df = mask(source=source, columns=column_list, strategy=strategy, strict=strict) | ||
| else: | ||
| masked_df = mask(data_path, columns=column_list, strategy=strategy, strict=strict) | ||
@@ -82,0 +110,0 @@ # Display any warnings |
| """Profile command - Generate data profiles. | ||
| This module implements the `truthound profile` command for generating | ||
| statistical profiles of data files. | ||
| This module implements the ``truthound profile`` command for generating | ||
| statistical profiles of data files and database tables. | ||
| """ | ||
@@ -14,3 +14,11 @@ | ||
| from truthound.cli_modules.common.errors import error_boundary, require_file | ||
| from truthound.cli_modules.common.datasource import ( | ||
| ConnectionOpt, | ||
| QueryOpt, | ||
| SourceConfigOpt, | ||
| SourceNameOpt, | ||
| TableOpt, | ||
| resolve_datasource, | ||
| ) | ||
| from truthound.cli_modules.common.errors import error_boundary | ||
@@ -21,5 +29,14 @@ | ||
| file: Annotated[ | ||
| Path, | ||
| typer.Argument(help="Path to the data file"), | ||
| ], | ||
| Optional[Path], | ||
| typer.Argument( | ||
| help="Path to the data file (CSV, JSON, Parquet, NDJSON)", | ||
| ), | ||
| ] = None, | ||
| # -- DataSource Options -- | ||
| connection: ConnectionOpt = None, | ||
| table: TableOpt = None, | ||
| query: QueryOpt = None, | ||
| source_config: SourceConfigOpt = None, | ||
| source_name: SourceNameOpt = None, | ||
| # -- Output Options -- | ||
| format: Annotated[ | ||
@@ -36,3 +53,3 @@ str, | ||
| This command analyzes the data file and generates statistics including: | ||
| This command analyzes the data and generates statistics including: | ||
| - Row and column counts | ||
@@ -48,10 +65,22 @@ - Null ratios per column | ||
| truthound profile data.csv -o profile.json | ||
| truthound profile --connection "postgresql://user:pass@host/db" --table users | ||
| truthound profile --source-config db.yaml --format json | ||
| """ | ||
| from truthound.api import profile | ||
| # Validate file exists | ||
| require_file(file) | ||
| # Resolve data source | ||
| data_path, source = resolve_datasource( | ||
| file=file, | ||
| connection=connection, | ||
| table=table, | ||
| query=query, | ||
| source_config=source_config, | ||
| source_name=source_name, | ||
| ) | ||
| try: | ||
| profile_report = profile(str(file)) | ||
| if source is not None: | ||
| profile_report = profile(source=source) | ||
| else: | ||
| profile_report = profile(data_path) | ||
| except Exception as e: | ||
@@ -58,0 +87,0 @@ typer.echo(f"Error: {e}", err=True) |
| """Scan command - Scan for PII. | ||
| This module implements the `truthound scan` command for detecting | ||
| personally identifiable information in data files. | ||
| This module implements the ``truthound scan`` command for detecting | ||
| personally identifiable information in data files and database tables. | ||
| """ | ||
@@ -14,3 +14,11 @@ | ||
| from truthound.cli_modules.common.errors import error_boundary, require_file | ||
| from truthound.cli_modules.common.datasource import ( | ||
| ConnectionOpt, | ||
| QueryOpt, | ||
| SourceConfigOpt, | ||
| SourceNameOpt, | ||
| TableOpt, | ||
| resolve_datasource, | ||
| ) | ||
| from truthound.cli_modules.common.errors import error_boundary | ||
@@ -21,5 +29,14 @@ | ||
| file: Annotated[ | ||
| Path, | ||
| typer.Argument(help="Path to the data file"), | ||
| ], | ||
| Optional[Path], | ||
| typer.Argument( | ||
| help="Path to the data file (CSV, JSON, Parquet, NDJSON)", | ||
| ), | ||
| ] = None, | ||
| # -- DataSource Options -- | ||
| connection: ConnectionOpt = None, | ||
| table: TableOpt = None, | ||
| query: QueryOpt = None, | ||
| source_config: SourceConfigOpt = None, | ||
| source_name: SourceNameOpt = None, | ||
| # -- Output Options -- | ||
| format: Annotated[ | ||
@@ -36,3 +53,3 @@ str, | ||
| This command analyzes data files to detect columns that may contain | ||
| This command analyzes data to detect columns that may contain | ||
| PII such as names, emails, phone numbers, SSNs, etc. | ||
@@ -44,11 +61,22 @@ | ||
| truthound scan data.csv -o pii_report.json | ||
| truthound scan data.csv --format html -o pii_report.html | ||
| truthound scan --connection "postgresql://user:pass@host/db" --table users | ||
| truthound scan --source-config db.yaml --format json | ||
| """ | ||
| from truthound.api import scan | ||
| # Validate file exists | ||
| require_file(file) | ||
| # Resolve data source | ||
| data_path, source = resolve_datasource( | ||
| file=file, | ||
| connection=connection, | ||
| table=table, | ||
| query=query, | ||
| source_config=source_config, | ||
| source_name=source_name, | ||
| ) | ||
| try: | ||
| pii_report = scan(str(file)) | ||
| if source is not None: | ||
| pii_report = scan(source=source) | ||
| else: | ||
| pii_report = scan(data_path) | ||
| except Exception as e: | ||
@@ -73,4 +101,5 @@ typer.echo(f"Error: {e}", err=True) | ||
| report_label = source_name or (source.name if source else str(file)) | ||
| html = generate_pii_html_report( | ||
| pii_report, title=f"PII Scan Report: {file.name}" | ||
| pii_report, title=f"PII Scan Report: {report_label}" | ||
| ) | ||
@@ -77,0 +106,0 @@ output.write_text(html, encoding="utf-8") |
@@ -190,3 +190,13 @@ """Factory functions for creating data sources. | ||
| if isinstance(data, str): | ||
| if data.startswith(("postgresql://", "postgres://")): | ||
| sql_prefixes = ( | ||
| "postgresql://", "postgres://", "mysql://", | ||
| "sqlite:", "duckdb:", "mssql://", "sqlserver://", | ||
| ) | ||
| sql_suffixes = (".db", ".duckdb") | ||
| is_sql = ( | ||
| data.startswith(sql_prefixes) | ||
| or data.endswith(sql_suffixes) | ||
| or "redshift.amazonaws.com" in data | ||
| ) | ||
| if is_sql: | ||
| table = kwargs.pop("table", None) | ||
@@ -197,18 +207,4 @@ if not table: | ||
| ) | ||
| from truthound.datasources.sql import PostgreSQLDataSource | ||
| return PostgreSQLDataSource.from_connection_string( | ||
| data, table=table, **kwargs | ||
| ) | ||
| return get_sql_datasource(data, table=table, **kwargs) | ||
| if data.startswith("mysql://"): | ||
| table = kwargs.pop("table", None) | ||
| if not table: | ||
| raise DataSourceError( | ||
| "SQL connection string requires 'table' parameter" | ||
| ) | ||
| from truthound.datasources.sql import MySQLDataSource | ||
| return MySQLDataSource.from_connection_string( | ||
| data, table=table, **kwargs | ||
| ) | ||
| # File doesn't exist | ||
@@ -262,2 +258,11 @@ if not path.exists(): | ||
| # SQLite: URI format (sqlite:///path) or file path (.db) | ||
| if connection_string.startswith("sqlite:"): | ||
| # sqlite:///path/to/db or sqlite:///:memory: | ||
| db_path = connection_string.replace("sqlite:///", "").replace("sqlite://", "") | ||
| if not db_path: | ||
| db_path = ":memory:" | ||
| from truthound.datasources.sql import SQLiteDataSource | ||
| return SQLiteDataSource(table=table, database=db_path, **kwargs) | ||
| if connection_string.endswith(".db") or connection_string == ":memory:": | ||
@@ -267,2 +272,24 @@ from truthound.datasources.sql import SQLiteDataSource | ||
| # DuckDB: URI format (duckdb:///path) or file suffix (.duckdb) | ||
| if connection_string.startswith("duckdb:") or connection_string.endswith(".duckdb"): | ||
| try: | ||
| from truthound.datasources.sql import DuckDBDataSource | ||
| except ImportError: | ||
| raise DataSourceError( | ||
| "DuckDB support requires duckdb. " | ||
| "Install with: pip install duckdb" | ||
| ) | ||
| if DuckDBDataSource is None: | ||
| raise DataSourceError( | ||
| "DuckDB support requires duckdb. " | ||
| "Install with: pip install duckdb" | ||
| ) | ||
| if connection_string.startswith("duckdb:"): | ||
| db_path = connection_string.replace("duckdb:///", "").replace("duckdb://", "") | ||
| if not db_path: | ||
| db_path = ":memory:" | ||
| else: | ||
| db_path = connection_string | ||
| return DuckDBDataSource(table=table, database=db_path, **kwargs) | ||
| # Oracle | ||
@@ -312,3 +339,4 @@ if connection_string.startswith("oracle://") or "oracle" in connection_string.lower(): | ||
| f"Unsupported SQL connection string format: {connection_string}. " | ||
| "Supported: postgresql://, mysql://, mssql://, SQLite file path. " | ||
| "Supported: postgresql://, mysql://, sqlite:///path, duckdb:///path, " | ||
| "mssql://, sqlserver://, *.db, *.duckdb. " | ||
| "For BigQuery, Snowflake, Redshift, Databricks, use their specific classes." | ||
@@ -351,4 +379,8 @@ ) | ||
| return "mysql" | ||
| if data.endswith(".db") or data == ":memory:": | ||
| if data.startswith("sqlite:") or data.endswith(".db") or data == ":memory:": | ||
| return "sqlite" | ||
| if data.startswith("duckdb:") or data.endswith(".duckdb"): | ||
| return "duckdb" | ||
| if data.startswith(("mssql://", "sqlserver://")): | ||
| return "sqlserver" | ||
| return "unknown" | ||
@@ -440,1 +472,78 @@ | ||
| return DictDataSource(data) | ||
| def get_datasource_from_config(config: dict[str, Any]) -> DataSourceProtocol: | ||
| """Create a DataSource from a configuration dictionary. | ||
| Convenience function for creating data sources from parsed | ||
| configuration files (JSON/YAML). Delegates to ``get_sql_datasource()`` | ||
| for connection-string-based configs or constructs backend-specific | ||
| classes from individual parameters. | ||
| Config styles supported: | ||
| Connection string:: | ||
| {"connection": "postgresql://user:pass@host/db", "table": "users"} | ||
| Individual parameters:: | ||
| {"type": "postgresql", "host": "localhost", "database": "mydb", | ||
| "user": "postgres", "password": "...", "table": "users"} | ||
| Args: | ||
| config: Configuration dictionary with connection details. | ||
| Returns: | ||
| Configured DataSource instance. | ||
| Raises: | ||
| DataSourceError: If the config is invalid or backend unavailable. | ||
| """ | ||
| connection = config.get("connection") | ||
| table = config.get("table") | ||
| query = config.get("query") | ||
| source_type = config.get("type") | ||
| # Style 1: Connection string | ||
| if connection: | ||
| if not table and not query: | ||
| raise DataSourceError( | ||
| "Config with 'connection' requires 'table' or 'query'." | ||
| ) | ||
| return get_sql_datasource( | ||
| connection, table=table or "__query__", query=query | ||
| ) | ||
| # Style 2: Individual parameters | ||
| if not source_type: | ||
| raise DataSourceError( | ||
| "Config must have either 'connection' or 'type' field." | ||
| ) | ||
| if not table and not query: | ||
| raise DataSourceError( | ||
| f"Config for type '{source_type}' requires 'table' or 'query'." | ||
| ) | ||
| from truthound.datasources.sql import get_available_sources | ||
| available = get_available_sources() | ||
| source_cls = available.get(source_type) | ||
| if source_cls is None: | ||
| available_names = [k for k, v in available.items() if v is not None] | ||
| raise DataSourceError( | ||
| f"Data source type '{source_type}' is not available. " | ||
| f"Available: {', '.join(available_names)}." | ||
| ) | ||
| # Build kwargs (exclude meta keys) | ||
| meta_keys = {"type", "table", "query", "name"} | ||
| kwargs: dict[str, Any] = {"table": table} if table else {} | ||
| if query: | ||
| kwargs["query"] = query | ||
| for key, value in config.items(): | ||
| if key not in meta_keys: | ||
| kwargs[key] = value | ||
| return source_cls(**kwargs) |
@@ -183,2 +183,49 @@ """Tests for check command --exclude-columns and --validator-config options.""" | ||
| class TestCheckDatasourceOptions: | ||
| """Tests for --connection, --table, and --source-config on check.""" | ||
| def test_check_with_connection_string(self, runner, app): | ||
| """--connection + --table passes source to API.""" | ||
| with ( | ||
| patch("truthound.datasources.factory.get_sql_datasource") as mock_sql, | ||
| patch("truthound.api.check") as mock_check, | ||
| ): | ||
| import polars as pl | ||
| mock_source = MagicMock() | ||
| mock_source.name = "users" | ||
| mock_source.to_polars_lazyframe.return_value = pl.LazyFrame({"id": [1]}) | ||
| mock_sql.return_value = mock_source | ||
| mock_report = MagicMock() | ||
| mock_report.has_issues = False | ||
| mock_report.exception_summary = None | ||
| mock_check.return_value = mock_report | ||
| result = runner.invoke(app, [ | ||
| "--connection", "postgresql://user:pass@host/db", | ||
| "--table", "users", | ||
| ]) | ||
| assert result.exit_code == 0 | ||
| # check() should be called with source= keyword | ||
| call_kwargs = mock_check.call_args | ||
| assert call_kwargs.kwargs.get("source") is not None | ||
| def test_check_file_and_connection_mutually_exclusive(self, runner, app, sample_csv): | ||
| """Both file and --connection raises error.""" | ||
| result = runner.invoke(app, [ | ||
| str(sample_csv), | ||
| "--connection", "postgresql://host/db", | ||
| "--table", "t", | ||
| ]) | ||
| assert result.exit_code != 0 | ||
| def test_check_connection_without_table_error(self, runner, app): | ||
| """--connection without --table raises error.""" | ||
| result = runner.invoke(app, [ | ||
| "--connection", "postgresql://host/db", | ||
| ]) | ||
| assert result.exit_code != 0 | ||
| class TestCombinedOptions: | ||
@@ -185,0 +232,0 @@ """Tests for combined --exclude-columns and --validator-config.""" |
| [build-system] | ||
| requires = ["hatchling"] | ||
| build-backend = "hatchling.build" | ||
| [project] | ||
| name = "truthound" | ||
| version = "1.3.2" | ||
| description = "Zero-Configuration Data Quality Framework Powered by Polars" | ||
| readme = "README.md" | ||
| license = "Apache-2.0" | ||
| requires-python = ">=3.11" | ||
| authors = [ | ||
| { name = "seadonggyun4", email = "seadonggyun4@gmail.com" } | ||
| ] | ||
| keywords = [ | ||
| "data-quality", | ||
| "data-validation", | ||
| "polars", | ||
| "pii-detection", | ||
| "data-masking", | ||
| ] | ||
| classifiers = [ | ||
| "Development Status :: 3 - Alpha", | ||
| "Intended Audience :: Developers", | ||
| "Intended Audience :: Science/Research", | ||
| "License :: OSI Approved :: Apache Software License", | ||
| "Operating System :: OS Independent", | ||
| "Programming Language :: Python :: 3", | ||
| "Programming Language :: Python :: 3.11", | ||
| "Programming Language :: Python :: 3.12", | ||
| "Topic :: Scientific/Engineering", | ||
| "Topic :: Software Development :: Libraries :: Python Modules", | ||
| "Typing :: Typed", | ||
| ] | ||
| dependencies = [ | ||
| "polars>=1.0.0", | ||
| "pyyaml>=6.0.0", | ||
| "rich>=13.0.0", | ||
| "typer>=0.12.0", | ||
| ] | ||
| [project.optional-dependencies] | ||
| # Report generation | ||
| reports = [ | ||
| "jinja2>=3.0.0", | ||
| ] | ||
| # Statistical drift detection | ||
| drift = [ | ||
| "scipy>=1.10.0", | ||
| ] | ||
| # Anomaly detection with ML | ||
| anomaly = [ | ||
| "scipy>=1.10.0", | ||
| "scikit-learn>=1.3.0", | ||
| ] | ||
| # Cloud storage backends | ||
| s3 = [ | ||
| "boto3>=1.26.0", | ||
| ] | ||
| gcs = [ | ||
| "google-cloud-storage>=2.0.0", | ||
| ] | ||
| azure = [ | ||
| "azure-storage-blob>=12.0.0", | ||
| ] | ||
| # Database storage backend | ||
| database = [ | ||
| "sqlalchemy>=2.0.0", | ||
| ] | ||
| # All storage backends | ||
| stores = [ | ||
| "boto3>=1.26.0", | ||
| "google-cloud-storage>=2.0.0", | ||
| "azure-storage-blob>=12.0.0", | ||
| "sqlalchemy>=2.0.0", | ||
| ] | ||
| # DuckDB database | ||
| duckdb = [ | ||
| "duckdb>=1.0.0", | ||
| ] | ||
| # NoSQL databases | ||
| mongodb = [ | ||
| "motor>=3.0.0", | ||
| ] | ||
| elasticsearch = [ | ||
| "elasticsearch[async]>=8.0.0", | ||
| ] | ||
| nosql = [ | ||
| "motor>=3.0.0", | ||
| "elasticsearch[async]>=8.0.0", | ||
| ] | ||
| # Streaming platforms | ||
| kafka = [ | ||
| "aiokafka>=0.9.0", | ||
| ] | ||
| streaming = [ | ||
| "aiokafka>=0.9.0", | ||
| ] | ||
| # All async datasources | ||
| async-datasources = [ | ||
| "motor>=3.0.0", | ||
| "elasticsearch[async]>=8.0.0", | ||
| "aiokafka>=0.9.0", | ||
| ] | ||
| # Interactive dashboard (Phase 8) | ||
| dashboard = [ | ||
| "reflex>=0.4.0", | ||
| ] | ||
| # PDF export support | ||
| pdf = [ | ||
| "weasyprint>=60.0", | ||
| ] | ||
| # Performance optimization | ||
| perf = [ | ||
| "xxhash>=3.4.0", | ||
| ] | ||
| # Full installation with all optional dependencies | ||
| all = [ | ||
| "jinja2>=3.0.0", | ||
| "pandas>=2.0.0", | ||
| "scipy>=1.10.0", | ||
| "scikit-learn>=1.3.0", | ||
| "boto3>=1.26.0", | ||
| "google-cloud-storage>=2.0.0", | ||
| "azure-storage-blob>=12.0.0", | ||
| "sqlalchemy>=2.0.0", | ||
| "duckdb>=1.0.0", | ||
| "reflex>=0.4.0", | ||
| "weasyprint>=60.0", | ||
| "motor>=3.0.0", | ||
| "elasticsearch[async]>=8.0.0", | ||
| "aiokafka>=0.9.0", | ||
| "xxhash>=3.4.0", | ||
| ] | ||
| # Development dependencies | ||
| dev = [ | ||
| "pytest>=8.0.0", | ||
| "pytest-cov>=4.0.0", | ||
| "pytest-asyncio>=0.23.0", | ||
| "ruff>=0.4.0", | ||
| "mypy>=1.10.0", | ||
| "pandas>=2.0.0", | ||
| "scipy>=1.10.0", | ||
| "scikit-learn>=1.3.0", | ||
| ] | ||
| [project.scripts] | ||
| truthound = "truthound.cli:app" | ||
| [project.urls] | ||
| Homepage = "https://github.com/seadonggyun4/Truthound" | ||
| Repository = "https://github.com/seadonggyun4/Truthound" | ||
| Issues = "https://github.com/seadonggyun4/Truthound/issues" | ||
| [tool.hatch.build.targets.wheel] | ||
| packages = ["src/truthound"] | ||
| [tool.hatch.envs.default] | ||
| dependencies = [ | ||
| "pytest>=8.0.0", | ||
| "pytest-cov>=4.0.0", | ||
| "ruff>=0.4.0", | ||
| "mypy>=1.10.0", | ||
| "pandas>=2.0.0", | ||
| ] | ||
| [tool.hatch.envs.default.scripts] | ||
| test = "pytest {args:tests}" | ||
| test-cov = "pytest --cov=truthound --cov-report=term-missing {args:tests}" | ||
| lint = "ruff check src tests" | ||
| format = "ruff format src tests" | ||
| typecheck = "mypy src" | ||
| [tool.ruff] | ||
| target-version = "py311" | ||
| line-length = 100 | ||
| src = ["src", "tests"] | ||
| [tool.ruff.lint] | ||
| select = [ | ||
| "E", # pycodestyle errors | ||
| "W", # pycodestyle warnings | ||
| "F", # Pyflakes | ||
| "I", # isort | ||
| "UP", # pyupgrade | ||
| "B", # flake8-bugbear | ||
| "SIM", # flake8-simplify | ||
| "TCH", # flake8-type-checking | ||
| ] | ||
| ignore = [ | ||
| "E501", # line too long (handled by formatter) | ||
| ] | ||
| [tool.ruff.lint.isort] | ||
| known-first-party = ["truthound"] | ||
| [tool.mypy] | ||
| python_version = "3.11" | ||
| strict = true | ||
| warn_return_any = true | ||
| warn_unused_ignores = true | ||
| [tool.pytest.ini_options] | ||
| testpaths = ["tests"] | ||
| pythonpath = ["src"] | ||
| markers = [ | ||
| "slow: marks tests as slow (deselect with '-m \"not slow\"')", | ||
| "e2e: marks tests as end-to-end tests", | ||
| "scale_100m: marks tests as 100M+ scale tests (run with '-m scale_100m')", | ||
| ] |
Alert delta unavailable
Currently unable to show alert delta for PyPI packages.
24536496
0.37%1499
0.4%493781
0.38%