truthound - PyPI Package Compare versions

+97

docs/cli/core/read.md

		# truthound read

		Read and preview data from files or database connections. Supports row/column selection, multiple output formats, and schema inspection.

		## Synopsis

		```bash
		truthound read [FILE] [OPTIONS]
		```

		## Arguments

		\| Argument \| Required \| Description \|
		\|----------\|----------\|-------------\|
		\| `file` \| No \| Path to the data file (CSV, JSON, Parquet, NDJSON) \|

		## Data Source Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--connection` \| `--conn` \| None \| Database connection string \|
		\| `--table` \| \| None \| Database table name \|
		\| `--query` \| \| None \| SQL query (alternative to `--table`) \|
		\| `--source-config` \| `--sc` \| None \| Path to data source config file (JSON/YAML) \|
		\| `--source-name` \| \| None \| Custom label for the data source \|

		## Selection Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--head` \| `-n` \| None \| Show only the first N rows \|
		\| `--sample` \| `-s` \| None \| Random sample of N rows \|
		\| `--columns` \| `-c` \| None \| Columns to include (comma-separated) \|

		## Output Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--format` \| `-f` \| `table` \| Output format (table, csv, json, parquet, ndjson) \|
		\| `--output` \| `-o` \| None \| Output file path \|

		## Inspection Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--schema-only` \| \| `false` \| Show only column names and types \|
		\| `--count-only` \| \| `false` \| Show only the row count \|

		## Examples

		### Basic Reading

		```bash
		truthound read data.csv
		truthound read data.parquet --head 20
		truthound read data.csv --columns id,name,age
		```

		### Database Reading

		```bash
		truthound read --connection "postgresql://user:pass@host/db" --table users
		truthound read --connection "sqlite:///data.db" --table orders --head 10
		truthound read --source-config db.yaml --sample 1000
		```

		### Schema Inspection

		```bash
		truthound read data.csv --schema-only
		truthound read --connection "postgresql://host/db" --table users --schema-only
		```

		### Format Conversion

		```bash
		truthound read data.csv --format json -o output.json
		truthound read data.csv --format parquet -o output.parquet
		truthound read data.csv --format csv --head 100
		```

		### Row Count

		```bash
		truthound read data.csv --count-only
		```

		## Related Commands

		- [`check`](check.md) - Validate data quality
		- [`profile`](profile.md) - Generate data profile
		- [`learn`](learn.md) - Learn schema from data

		## See Also

		- [Python API: th.read()](../../python-api/core-functions.md#thread)
		- [Data Source Options](../../guides/datasources/cli-datasource-guide.md)

+245

docs/guides/datasources/cli-datasource-guide.md

		# CLI Data Source Guide

		All Truthound CLI commands support reading data from databases and external sources in addition to local files. This guide covers the shared data source options available across all core commands.

		## Overview

		Truthound CLI commands accept data from three input modes:

		1. File mode (default): Pass a file path as a positional argument
		2. Connection string mode: Use `--connection` and `--table` (or `--query`) to connect to a database
		3. Source config mode: Use `--source-config` to load connection details from a JSON or YAML file

		These modes are mutually exclusive. If a file argument is provided alongside connection options, the file takes precedence.

		## Data Source Options

		The following options are available on all core commands (`check`, `scan`, `mask`, `profile`, `learn`, `compare`, `read`):

		\| Option \| Short \| Description \|
		\|--------\|-------\|-------------\|
		\| `--connection` \| `--conn` \| Database connection string (see formats below) \|
		\| `--table` \| \| Database table name to read \|
		\| `--query` \| \| SQL query to execute (alternative to `--table`) \|
		\| `--source-config` \| `--sc` \| Path to a data source config file (JSON or YAML) \|
		\| `--source-name` \| \| Custom label for the data source (used in reports) \|

		## Connection String Formats

		### PostgreSQL

		```
		postgresql://user:password@host:5432/dbname
		```

		Install the PostgreSQL backend:

		```bash
		pip install truthound[postgresql]
		```

		### MySQL

		```
		mysql://user:password@host:3306/dbname
		```

		Install the MySQL backend:

		```bash
		pip install truthound[mysql]
		```

		### SQLite

		```
		sqlite:///path/to/database.db
		sqlite:///./relative/path.db
		```

		SQLite is included by default; no extra install is needed.

		### DuckDB

		```
		duckdb:///path/to/database.duckdb
		duckdb:///:memory:
		```

		Install the DuckDB backend:

		```bash
		pip install truthound[duckdb]
		```

		### Microsoft SQL Server

		```
		mssql://user:password@host:1433/dbname
		```

		Install the SQL Server backend:

		```bash
		pip install truthound[mssql]
		```

		## Source Config File Format

		For repeatable or complex connection setups, use a source config file with `--source-config`.

		### JSON Example

		```json
		{
		"type": "postgresql",
		"connection": "postgresql://user:password@host:5432/dbname",
		"table": "users",
		"source_name": "production-users"
		}
		```

		### YAML Example

		```yaml
		type: postgresql
		connection: postgresql://user:password@host:5432/dbname
		table: users
		source_name: production-users
		```

		### Using a SQL Query

		```yaml
		type: postgresql
		connection: postgresql://user:password@host:5432/dbname
		query: "SELECT id, name, email FROM users WHERE active = true"
		source_name: active-users
		```

		## Dual-Source Config (for `compare`)

		The `compare` command accepts two data sources. You can provide a source config file that defines both `baseline` and `current`:

		```yaml
		baseline:
		type: postgresql
		connection: postgresql://user:pass@host/db
		table: users_baseline

		current:
		type: postgresql
		connection: postgresql://user:pass@host/db
		table: users_current
		```

		Usage:

		```bash
		truthound compare --source-config compare_sources.yaml --method psi
		```

		Alternatively, you can specify individual files or connections for each source on the command line.

		## Per-Backend Install Hints

		Truthound uses optional dependency groups for database backends. Install only what you need:

		\| Backend \| Install Command \|
		\|---------\|----------------\|
		\| PostgreSQL \| `pip install truthound[postgresql]` \|
		\| MySQL \| `pip install truthound[mysql]` \|
		\| DuckDB \| `pip install truthound[duckdb]` \|
		\| SQL Server \| `pip install truthound[mssql]` \|
		\| BigQuery \| `pip install truthound[bigquery]` \|
		\| Snowflake \| `pip install truthound[snowflake]` \|
		\| All databases \| `pip install truthound[databases]` \|

		SQLite support is included in the base install.

		## Security Considerations

		Do not put passwords directly in CLI history. Connection strings with embedded credentials are visible in shell history and process listings.

		Recommended practices:

		1. Use environment variables:

		```bash
		export DB_CONN="postgresql://user:password@host/db"
		truthound check --connection "$DB_CONN" --table users
		```

		2. Use source config files with restricted file permissions:

		```bash
		chmod 600 db_config.yaml
		truthound check --source-config db_config.yaml
		```

		3. Use `.pgpass` or equivalent credential files supported by your database client.

		4. Avoid inline passwords in CI/CD pipelines. Use secrets management (GitHub Secrets, Vault, etc.) and inject via environment variables.

		## Examples for Each Command

		### check

		```bash
		# Validate a PostgreSQL table
		truthound check --connection "postgresql://user:pass@host/db" --table orders

		# Validate with source config
		truthound check --source-config prod_db.yaml --strict
		```

		### scan

		```bash
		# Scan a database table for PII
		truthound scan --connection "postgresql://user:pass@host/db" --table customers
		```

		### mask

		```bash
		# Mask PII in a database table and write to a file
		truthound mask --connection "sqlite:///data.db" --table users -o masked_users.csv
		```

		### profile

		```bash
		# Profile a database table
		truthound profile --connection "postgresql://user:pass@host/db" --table transactions
		```

		### learn

		```bash
		# Learn schema from a database table
		truthound learn --connection "postgresql://user:pass@host/db" --table products -o schema.yaml
		```

		### compare

		```bash
		# Compare two database tables
		truthound compare --source-config compare_sources.yaml --method psi --strict
		```

		### read

		```bash
		# Preview a database table
		truthound read --connection "postgresql://user:pass@host/db" --table users --head 20

		# Run a SQL query and export as CSV
		truthound read --connection "sqlite:///data.db" --query "SELECT * FROM orders WHERE total > 100" --format csv -o high_orders.csv
		```

		## See Also

		- [Data Sources Overview](index.md)
		- [Database Connections](databases.md)
		- [CLI Core Commands](../../cli/core/index.md)

+504

src/truthound/cli_modules/common/datasource.py

		"""Shared DataSource resolution for CLI commands.

		This module provides a unified abstraction layer that resolves CLI
		options (file path, connection string, or config file) into either
		a file path string or a BaseDataSource instance. All core CLI commands
		use this layer for consistent data source handling.

		Architecture:
		CLI options → resolve_datasource() → (file_path \| None, source \| None)
		↓ ↓
		api.func(data=...) api.func(source=...)
		"""

		from __future__ import annotations

		import json
		import logging
		from pathlib import Path
		from typing import TYPE_CHECKING, Annotated, Any, Optional

		import typer

		from truthound.cli_modules.common.errors import (
		CLIError,
		DataSourceError,
		ErrorCode,
		FileNotFoundError,
		require_file,
		)

		if TYPE_CHECKING:
		from truthound.datasources.base import BaseDataSource

		logger = logging.getLogger(__name__)


		# =============================================================================
		# Reusable Annotated CLI Options
		# =============================================================================

		ConnectionOpt = Annotated[
		Optional[str],
		typer.Option(
		"--connection",
		"--conn",
		help=(
		"Database connection string. "
		"Examples: postgresql://user:pass@host:5432/db, "
		"mysql://user:pass@host/db, sqlite:///path/to.db"
		),
		),
		]

		TableOpt = Annotated[
		Optional[str],
		typer.Option(
		"--table",
		help="Database table name (required with --connection for SQL sources)",
		),
		]

		QueryOpt = Annotated[
		Optional[str],
		typer.Option(
		"--query",
		help="SQL query to validate (alternative to --table)",
		),
		]

		SourceConfigOpt = Annotated[
		Optional[Path],
		typer.Option(
		"--source-config",
		"--sc",
		help=(
		"Path to data source configuration file (JSON/YAML). "
		"See docs for config file format."
		),
		),
		]

		SourceNameOpt = Annotated[
		Optional[str],
		typer.Option(
		"--source-name",
		help="Custom name for the data source (used in report labels)",
		),
		]


		# =============================================================================
		# DataSource Resolution
		# =============================================================================


		def resolve_datasource(
		file: Path \| None = None,
		connection: str \| None = None,
		table: str \| None = None,
		query: str \| None = None,
		source_config: Path \| None = None,
		source_name: str \| None = None,
		) -> tuple[str \| None, "BaseDataSource \| None"]:
		"""Resolve CLI options into a file path or BaseDataSource instance.

		This is the central resolution function used by all CLI commands.
		It enforces mutual exclusivity between input modes and validates
		required parameters for each mode.

		Args:
		file: Path to a data file (CSV, JSON, Parquet, etc.)
		connection: Database connection string
		table: Database table name (for SQL sources)
		query: SQL query string (alternative to table)
		source_config: Path to a JSON/YAML data source config file
		source_name: Custom label for the data source

		Returns:
		A tuple of (file_path, source) where exactly one is non-None.

		Raises:
		DataSourceError: If inputs are invalid or conflicting.
		FileNotFoundError: If the specified file does not exist.
		"""
		_validate_input_exclusivity(file, connection, source_config)

		# Mode 1: Source config file
		if source_config is not None:
		require_file(source_config, "Source config file")
		config = parse_source_config(source_config)
		source = create_datasource_from_config(config)
		if source_name:
		_set_source_name(source, source_name)
		return None, source

		# Mode 2: Connection string
		if connection is not None:
		source = _create_from_connection(connection, table, query, source_name)
		return None, source

		# Mode 3: File path (legacy, default)
		if file is not None:
		require_file(file)
		return str(file), None

		# No input provided
		raise DataSourceError(
		"No data input specified.",
		hint=(
		"Provide one of:\n"
		" - A file path: truthound <command> data.csv\n"
		" - A connection: truthound <command> --connection 'postgresql://...' --table users\n"
		" - A config file: truthound <command> --source-config db.yaml"
		),
		)


		def resolve_compare_sources(
		baseline: Path \| None = None,
		current: Path \| None = None,
		source_config: Path \| None = None,
		) -> tuple[
		tuple[str \| None, "BaseDataSource \| None"],
		tuple[str \| None, "BaseDataSource \| None"],
		]:
		"""Resolve inputs for the compare command (dual-source).

		Args:
		baseline: Baseline file path
		current: Current file path
		source_config: Config file with baseline/current sections

		Returns:
		Tuple of (baseline_resolution, current_resolution),
		each a (file_path \| None, source \| None) pair.

		Raises:
		DataSourceError: If inputs are invalid or conflicting.
		"""
		if source_config is not None:
		if baseline is not None or current is not None:
		raise DataSourceError(
		"Cannot specify both file paths and --source-config for compare.",
		hint="Use either positional file args OR --source-config, not both.",
		)
		require_file(source_config, "Source config file")
		config = parse_source_config(source_config)

		baseline_cfg = config.get("baseline")
		current_cfg = config.get("current")
		if not baseline_cfg or not current_cfg:
		raise DataSourceError(
		"Compare source config must have 'baseline' and 'current' sections.",
		hint=(
		"Example config:\n"
		" baseline:\n"
		" connection: postgresql://...\n"
		" table: train_data\n"
		" current:\n"
		" connection: postgresql://...\n"
		" table: prod_data"
		),
		)

		baseline_source = create_datasource_from_config(baseline_cfg)
		current_source = create_datasource_from_config(current_cfg)
		return (None, baseline_source), (None, current_source)

		# File-based path
		if baseline is None or current is None:
		raise DataSourceError(
		"Both baseline and current data must be specified.",
		hint=(
		"Provide two file paths:\n"
		" truthound compare baseline.csv current.csv\n"
		"Or use --source-config with baseline/current sections."
		),
		)

		require_file(baseline, "Baseline file")
		require_file(current, "Current file")
		return (str(baseline), None), (str(current), None)


		# =============================================================================
		# Config File Parsing
		# =============================================================================


		def parse_source_config(config_path: Path) -> dict[str, Any]:
		"""Parse a data source configuration file (JSON or YAML).

		Supported formats:
		- JSON (.json)
		- YAML (.yaml, .yml)

		Config schema for single source:
		type: postgresql
		connection: "postgresql://user:pass@host:5432/db"
		table: users

		Config schema for compare (dual source):
		baseline:
		connection: "postgresql://..."
		table: train_data
		current:
		connection: "postgresql://..."
		table: prod_data

		Args:
		config_path: Path to the configuration file.

		Returns:
		Parsed configuration dictionary.

		Raises:
		DataSourceError: If the file cannot be parsed.
		"""
		content = config_path.read_text(encoding="utf-8")
		suffix = config_path.suffix.lower()

		if suffix == ".json":
		try:
		config = json.loads(content)
		except json.JSONDecodeError as e:
		raise DataSourceError(
		f"Invalid JSON in source config: {e}",
		hint=f"Check the syntax of {config_path}",
		)
		elif suffix in (".yaml", ".yml"):
		try:
		import yaml

		config = yaml.safe_load(content)
		except ImportError:
		raise DataSourceError(
		"YAML config requires PyYAML.",
		hint="Install with: pip install pyyaml",
		)
		except Exception as e:
		raise DataSourceError(
		f"Invalid YAML in source config: {e}",
		hint=f"Check the syntax of {config_path}",
		)
		else:
		raise DataSourceError(
		f"Unsupported config file format: {suffix}",
		hint="Use .json, .yaml, or .yml",
		)

		if not isinstance(config, dict):
		raise DataSourceError(
		"Source config must be a JSON/YAML object (dictionary).",
		hint=f"Check {config_path}",
		)

		return config


		def create_datasource_from_config(config: dict[str, Any]) -> "BaseDataSource":
		"""Create a BaseDataSource from a parsed configuration dictionary.

		Supports two config styles:
		1. Connection string style:
		{"connection": "postgresql://...", "table": "users"}

		2. Individual parameters style:
		{"type": "postgresql", "host": "localhost", "database": "mydb",
		"user": "postgres", "password": "...", "table": "users"}

		Args:
		config: Configuration dictionary.

		Returns:
		Configured BaseDataSource instance.

		Raises:
		DataSourceError: If the config is invalid or the backend is unavailable.
		"""
		from truthound.datasources.factory import get_sql_datasource
		from truthound.datasources.sql import get_available_sources

		connection = config.get("connection")
		table = config.get("table")
		query = config.get("query")
		source_type = config.get("type")

		# Style 1: Connection string
		if connection:
		if not table and not query:
		raise DataSourceError(
		"Config with 'connection' requires 'table' or 'query'.",
		hint="Add a 'table' or 'query' field to your config file.",
		)
		try:
		return get_sql_datasource(
		connection, table=table or "__query__", query=query
		)
		except Exception as e:
		raise DataSourceError(
		f"Failed to create data source from connection string: {e}",
		source_type=source_type,
		)

		# Style 2: Individual parameters with type
		if not source_type:
		raise DataSourceError(
		"Config must have either 'connection' or 'type' field.",
		hint=(
		"Example:\n"
		" connection: postgresql://user:pass@host:5432/db\n"
		" table: users\n"
		"Or:\n"
		" type: postgresql\n"
		" host: localhost\n"
		" database: mydb\n"
		" table: users"
		),
		)

		if not table and not query:
		raise DataSourceError(
		f"Config for type '{source_type}' requires 'table' or 'query'.",
		)

		available = get_available_sources()
		source_cls = available.get(source_type)
		if source_cls is None:
		available_names = [k for k, v in available.items() if v is not None]
		raise DataSourceError(
		f"Data source type '{source_type}' is not available.",
		source_type=source_type,
		hint=(
		f"Available types: {', '.join(available_names)}. "
		f"You may need to install the required driver."
		),
		)

		# Build constructor kwargs from config (exclude meta keys)
		meta_keys = {"type", "table", "query", "name"}
		kwargs: dict[str, Any] = {}
		if table:
		kwargs["table"] = table
		if query:
		kwargs["query"] = query
		for key, value in config.items():
		if key not in meta_keys:
		kwargs[key] = value

		try:
		return source_cls(**kwargs)
		except TypeError as e:
		raise DataSourceError(
		f"Invalid config for '{source_type}': {e}",
		source_type=source_type,
		hint=f"Check the supported parameters for {source_type} data source.",
		)
		except Exception as e:
		raise DataSourceError(
		f"Failed to create '{source_type}' data source: {e}",
		source_type=source_type,
		)


		# =============================================================================
		# Internal Helpers
		# =============================================================================


		def _validate_input_exclusivity(
		file: Path \| None,
		connection: str \| None,
		source_config: Path \| None,
		) -> None:
		"""Validate that at most one data input mode is specified."""
		modes = []
		if file is not None:
		modes.append("file argument")
		if connection is not None:
		modes.append("--connection")
		if source_config is not None:
		modes.append("--source-config")

		if len(modes) > 1:
		raise DataSourceError(
		f"Conflicting data inputs: {' and '.join(modes)}.",
		hint="Specify only one: a file path, --connection, or --source-config.",
		)


		def _create_from_connection(
		connection: str,
		table: str \| None,
		query: str \| None,
		source_name: str \| None,
		) -> "BaseDataSource":
		"""Create a BaseDataSource from a connection string."""
		from truthound.datasources.factory import get_sql_datasource

		if not table and not query:
		raise DataSourceError(
		"--table or --query is required with --connection.",
		hint=(
		"Example:\n"
		" --connection 'postgresql://user:pass@host/db' --table users\n"
		" --connection 'sqlite:///data.db' --query 'SELECT * FROM orders'"
		),
		)

		try:
		target = table or "__query__"
		source = get_sql_datasource(connection, table=target, query=query)
		except ImportError as e:
		_raise_driver_hint(connection, e)
		except Exception as e:
		raise DataSourceError(
		f"Failed to connect: {e}",
		hint="Check the connection string format and database availability.",
		)

		if source_name:
		_set_source_name(source, source_name)

		return source


		def _set_source_name(source: "BaseDataSource", name: str) -> None:
		"""Attempt to set a custom name on a data source."""
		if hasattr(source, "config") and hasattr(source.config, "name"):
		try:
		source.config.name = name
		except (AttributeError, TypeError):
		pass


		def _raise_driver_hint(connection: str, error: ImportError) -> None:
		"""Raise a DataSourceError with install hints based on connection string."""
		conn_lower = connection.lower()
		hints = {
		"postgresql": ("psycopg2-binary", "pip install truthound[postgresql]"),
		"postgres": ("psycopg2-binary", "pip install truthound[postgresql]"),
		"mysql": ("pymysql", "pip install truthound[mysql]"),
		"oracle": ("oracledb", "pip install oracledb"),
		"mssql": ("pyodbc", "pip install pyodbc"),
		"sqlserver": ("pyodbc", "pip install pyodbc"),
		"bigquery": ("google-cloud-bigquery", "pip install truthound[bigquery]"),
		"snowflake": ("snowflake-connector-python", "pip install truthound[snowflake]"),
		"redshift": ("redshift-connector", "pip install truthound[redshift]"),
		"databricks": ("databricks-sql-connector", "pip install truthound[databricks]"),
		"duckdb": ("duckdb", "pip install duckdb"),
		}

		for prefix, (pkg, install_cmd) in hints.items():
		if prefix in conn_lower:
		raise DataSourceError(
		f"Missing driver for {prefix}: {error}",
		source_type=prefix,
		hint=f"Install with: {install_cmd}",
		)

		raise DataSourceError(
		f"Missing driver: {error}",
		hint="Check that the required database driver is installed.",
		)

+226

src/truthound/cli_modules/core/read.py

		"""Read command - Read and preview data from various sources.

		This module implements the ``truthound read`` command for loading,
		inspecting, and exporting data from files and database connections.
		"""

		from __future__ import annotations

		from pathlib import Path
		from typing import Annotated, Optional

		import typer

		from truthound.cli_modules.common.datasource import (
		ConnectionOpt,
		QueryOpt,
		SourceConfigOpt,
		SourceNameOpt,
		TableOpt,
		resolve_datasource,
		)
		from truthound.cli_modules.common.errors import error_boundary
		from truthound.cli_modules.common.options import parse_list_callback


		@error_boundary
		def read_cmd(
		file: Annotated[
		Optional[Path],
		typer.Argument(
		help="Path to the data file (CSV, JSON, Parquet, NDJSON)",
		),
		] = None,
		# -- DataSource Options --
		connection: ConnectionOpt = None,
		table: TableOpt = None,
		query: QueryOpt = None,
		source_config: SourceConfigOpt = None,
		source_name: SourceNameOpt = None,
		# -- Row Selection --
		sample: Annotated[
		Optional[int],
		typer.Option(
		"--sample",
		"-s",
		help="Return a random sample of N rows",
		min=1,
		),
		] = None,
		head: Annotated[
		Optional[int],
		typer.Option(
		"--head",
		"-n",
		help="Show only the first N rows",
		min=1,
		),
		] = None,
		# -- Column Selection --
		columns: Annotated[
		Optional[list[str]],
		typer.Option(
		"--columns",
		"-c",
		help="Columns to include (comma-separated)",
		),
		] = None,
		# -- Output Options --
		format: Annotated[
		str,
		typer.Option(
		"--format",
		"-f",
		help="Output format (table, csv, json, parquet, ndjson)",
		),
		] = "table",
		output: Annotated[
		Optional[Path],
		typer.Option("--output", "-o", help="Output file path"),
		] = None,
		# -- Inspection Modes --
		schema_only: Annotated[
		bool,
		typer.Option(
		"--schema-only",
		help="Show only column names and types (no data loaded)",
		),
		] = False,
		count_only: Annotated[
		bool,
		typer.Option(
		"--count-only",
		help="Show only the row count",
		),
		] = False,
		) -> None:
		"""Read and preview data from files or databases.

		Load data from various sources and display a preview, export to
		another format, or inspect the schema. Supports files (CSV, Parquet,
		JSON) and SQL databases via --connection.

		Examples:
		truthound read data.csv
		truthound read data.parquet --head 20
		truthound read data.csv --format json -o output.json
		truthound read data.csv --columns id,name,age
		truthound read --connection "postgresql://user:pass@host/db" --table users
		truthound read --connection "sqlite:///data.db" --table orders --head 10
		truthound read --source-config db.yaml --sample 1000
		truthound read data.csv --schema-only
		truthound read data.csv --count-only
		"""
		import polars as pl

		# Resolve data source
		data_path, source = resolve_datasource(
		file=file,
		connection=connection,
		table=table,
		query=query,
		source_config=source_config,
		source_name=source_name,
		)

		# Load data as LazyFrame
		if source is not None:
		lf = source.to_polars_lazyframe()
		label = source.name
		else:
		from truthound.adapters import to_lazyframe

		lf = to_lazyframe(data_path)
		label = data_path

		# Schema-only mode: no data collection needed
		if schema_only:
		schema = lf.collect_schema()
		typer.echo(f"Source: {label}")
		typer.echo(f"Columns: {len(schema)}\n")
		typer.echo(f"{'Column':<40} {'Type':<20}")
		typer.echo("-" * 60)
		for col_name, col_type in schema.items():
		typer.echo(f"{col_name:<40} {str(col_type):<20}")
		return

		# Count-only mode: minimal collection
		if count_only:
		row_count = lf.select(pl.len()).collect().item()
		typer.echo(f"Source: {label}")
		typer.echo(f"Rows: {row_count:,}")
		return

		# Collect data
		df = lf.collect()

		# Column selection
		column_list = parse_list_callback(columns) if columns else None
		if column_list:
		available = set(df.columns)
		missing = [c for c in column_list if c not in available]
		if missing:
		typer.echo(
		f"Warning: columns not found: {', '.join(missing)}", err=True
		)
		valid_cols = [c for c in column_list if c in available]
		if valid_cols:
		df = df.select(valid_cols)

		# Row selection
		if sample is not None and len(df) > sample:
		df = df.sample(n=sample, seed=42)
		if head is not None:
		df = df.head(head)

		# Output
		if format == "parquet" and output is None:
		typer.echo(
		"Error: --output is required for parquet format", err=True
		)
		raise typer.Exit(1)

		if output:
		_write_output(df, output, format)
		typer.echo(f"Data written to {output} ({len(df):,} rows)")
		else:
		_print_output(df, format, label)


		def _write_output(df: "pl.DataFrame", output: Path, fmt: str) -> None:
		"""Write DataFrame to a file in the specified format."""
		suffix = output.suffix.lower()
		fmt_lower = fmt.lower()

		if fmt_lower == "parquet" or suffix == ".parquet":
		df.write_parquet(output)
		elif fmt_lower == "csv" or suffix == ".csv":
		df.write_csv(output)
		elif fmt_lower == "json" or suffix == ".json":
		df.write_json(output)
		elif fmt_lower == "ndjson" or suffix == ".ndjson":
		df.write_ndjson(output)
		else:
		# Default: CSV
		df.write_csv(output)


		def _print_output(df: "pl.DataFrame", fmt: str, label: str \| None) -> None:
		"""Print DataFrame to stdout."""
		import polars as pl

		fmt_lower = fmt.lower()

		if fmt_lower == "json":
		typer.echo(df.write_json())
		elif fmt_lower == "csv":
		typer.echo(df.write_csv())
		elif fmt_lower == "ndjson":
		typer.echo(df.write_ndjson())
		else:
		# Table format: use Polars' built-in display
		if label:
		typer.echo(f"Source: {label}")
		typer.echo(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns\n")
		with pl.Config(tbl_rows=50, tbl_cols=20, fmt_str_lengths=80):
		typer.echo(str(df))

+296

tests/cli_modules/test_datasource_commands.py

		"""Tests for DataSource support across CLI commands.

		Verifies that scan, mask, profile, learn, and compare commands
		correctly accept and pass through database connection options.
		"""

		from __future__ import annotations

		import json
		import pytest
		from pathlib import Path
		from unittest.mock import MagicMock, patch

		import polars as pl
		import typer
		from typer.testing import CliRunner

		from truthound.cli_modules.core.check import check_cmd
		from truthound.cli_modules.core.scan import scan_cmd
		from truthound.cli_modules.core.mask import mask_cmd
		from truthound.cli_modules.core.profile import profile_cmd
		from truthound.cli_modules.core.learn import learn_cmd
		from truthound.cli_modules.core.compare import compare_cmd


		@pytest.fixture
		def runner():
		return CliRunner()


		def _make_app(cmd, name):
		app = typer.Typer()
		app.command(name=name)(cmd)
		return app


		@pytest.fixture
		def sample_csv(tmp_path):
		csv = tmp_path / "data.csv"
		csv.write_text("id,name,age\n1,Alice,25\n2,Bob,30\n")
		return csv


		def _mock_sql_source(table_name="users"):
		"""Create a mock SQL data source returning a small DataFrame."""
		source = MagicMock()
		source.name = table_name
		lf = pl.LazyFrame({"id": [1, 2], "name": ["Alice", "Bob"], "age": [25, 30]})
		source.to_polars_lazyframe.return_value = lf
		return source


		# =============================================================================
		# Check with DataSource
		# =============================================================================


		class TestCheckWithDatasource:
		"""Test check command accepts datasource options."""

		def test_check_with_connection(self, runner, sample_csv):
		"""--connection passes source= to check API."""
		app = _make_app(check_cmd, "check")

		with (
		patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
		patch("truthound.api.check") as mock_check,
		):
		mock_sql.return_value = _mock_sql_source()
		mock_report = MagicMock()
		mock_report.has_issues = False
		mock_report.exception_summary = None
		mock_check.return_value = mock_report

		result = runner.invoke(app, [
		"--connection", "postgresql://user:pass@host/db",
		"--table", "users",
		])

		assert result.exit_code == 0
		# Verify source= was passed (not data path)
		assert mock_check.call_args[1].get("source") is not None or \
		mock_check.call_args.kwargs.get("source") is not None

		def test_check_file_and_connection_mutually_exclusive(self, runner, sample_csv):
		"""file + --connection raises error."""
		app = _make_app(check_cmd, "check")

		result = runner.invoke(app, [
		str(sample_csv),
		"--connection", "postgresql://host/db",
		"--table", "t",
		])
		assert result.exit_code != 0


		# =============================================================================
		# Scan with DataSource
		# =============================================================================


		class TestScanWithDatasource:
		"""Test scan command accepts datasource options."""

		def test_scan_with_connection(self, runner):
		"""--connection passes source= to scan API."""
		app = _make_app(scan_cmd, "scan")

		with (
		patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
		patch("truthound.api.scan") as mock_scan,
		):
		mock_sql.return_value = _mock_sql_source()
		mock_report = MagicMock()
		mock_scan.return_value = mock_report
		mock_report.print = MagicMock()

		result = runner.invoke(app, [
		"--connection", "postgresql://user:pass@host/db",
		"--table", "users",
		])

		assert result.exit_code == 0
		mock_scan.assert_called_once()
		call_kwargs = mock_scan.call_args
		# scan(source=source)
		assert call_kwargs.kwargs.get("source") is not None

		def test_scan_no_input_error(self, runner):
		"""No input produces an error."""
		app = _make_app(scan_cmd, "scan")
		result = runner.invoke(app, [])
		assert result.exit_code != 0


		# =============================================================================
		# Mask with DataSource
		# =============================================================================


		class TestMaskWithDatasource:
		"""Test mask command accepts datasource options."""

		def test_mask_with_connection(self, runner, tmp_path):
		"""--connection passes source= to mask API."""
		app = _make_app(mask_cmd, "mask")
		out = tmp_path / "masked.csv"

		with (
		patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
		patch("truthound.api.mask") as mock_mask,
		):
		mock_sql.return_value = _mock_sql_source()
		mock_df = pl.DataFrame({"id": [1, 2], "name": ["*", "*"]})
		mock_mask.return_value = mock_df

		result = runner.invoke(app, [
		"--connection", "postgresql://user:pass@host/db",
		"--table", "users",
		"--output", str(out),
		])

		assert result.exit_code == 0
		assert out.exists()
		mock_mask.assert_called_once()


		# =============================================================================
		# Profile with DataSource
		# =============================================================================


		class TestProfileWithDatasource:
		"""Test profile command accepts datasource options."""

		def test_profile_with_connection(self, runner):
		"""--connection passes source= to profile API."""
		app = _make_app(profile_cmd, "profile")

		with (
		patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
		patch("truthound.api.profile") as mock_profile,
		):
		mock_sql.return_value = _mock_sql_source()
		mock_report = MagicMock()
		mock_profile.return_value = mock_report
		mock_report.print = MagicMock()

		result = runner.invoke(app, [
		"--connection", "postgresql://user:pass@host/db",
		"--table", "users",
		])

		assert result.exit_code == 0
		mock_profile.assert_called_once()
		assert mock_profile.call_args.kwargs.get("source") is not None


		# =============================================================================
		# Learn with DataSource
		# =============================================================================


		class TestLearnWithDatasource:
		"""Test learn command accepts datasource options."""

		def test_learn_with_connection(self, runner, tmp_path):
		"""--connection passes source= to learn API."""
		app = _make_app(learn_cmd, "learn")
		out = tmp_path / "schema.yaml"

		with (
		patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
		patch("truthound.schema.learn") as mock_learn,
		):
		mock_sql.return_value = _mock_sql_source()
		mock_schema = MagicMock()
		mock_schema.columns = ["id", "name", "age"]
		mock_schema.row_count = 2
		mock_learn.return_value = mock_schema

		result = runner.invoke(app, [
		"--connection", "postgresql://user:pass@host/db",
		"--table", "users",
		"--output", str(out),
		])

		assert result.exit_code == 0
		mock_learn.assert_called_once()
		assert mock_learn.call_args.kwargs.get("source") is not None


		# =============================================================================
		# Compare with DataSource Config
		# =============================================================================


		class TestCompareWithDatasource:
		"""Test compare command accepts --source-config for dual sources."""

		def test_compare_with_source_config(self, runner, tmp_path):
		"""--source-config with baseline/current sections works."""
		app = _make_app(compare_cmd, "compare")

		cfg = tmp_path / "drift.yaml"
		cfg.write_text(
		"baseline:\n"
		" connection: 'postgresql://host/db'\n"
		" table: train\n"
		"current:\n"
		" connection: 'postgresql://host/db'\n"
		" table: prod\n"
		)

		with (
		patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
		patch("truthound.drift.compare") as mock_compare,
		):
		source_b = MagicMock()
		source_c = MagicMock()
		lf_b = pl.LazyFrame({"x": [1, 2, 3]})
		lf_c = pl.LazyFrame({"x": [4, 5, 6]})
		source_b.to_polars_lazyframe.return_value = lf_b
		source_c.to_polars_lazyframe.return_value = lf_c
		mock_sql.side_effect = [source_b, source_c]

		mock_report = MagicMock()
		mock_report.has_drift = False
		mock_compare.return_value = mock_report
		mock_report.print = MagicMock()

		result = runner.invoke(app, [
		"--source-config", str(cfg),
		])

		assert result.exit_code == 0
		mock_compare.assert_called_once()

		def test_compare_files_still_works(self, runner, tmp_path):
		"""Positional file arguments still work."""
		app = _make_app(compare_cmd, "compare")

		f1 = tmp_path / "base.csv"
		f2 = tmp_path / "curr.csv"
		f1.write_text("x\n1\n2\n3\n")
		f2.write_text("x\n4\n5\n6\n")

		with patch("truthound.drift.compare") as mock_compare:
		mock_report = MagicMock()
		mock_report.has_drift = False
		mock_compare.return_value = mock_report
		mock_report.print = MagicMock()

		result = runner.invoke(app, [str(f1), str(f2)])
		assert result.exit_code == 0
		mock_compare.assert_called_once()

+365

tests/cli_modules/test_datasource_resolution.py

		"""Tests for the shared DataSource resolution layer.

		Tests cover: resolve_datasource(), resolve_compare_sources(),
		parse_source_config(), create_datasource_from_config(), and
		input validation logic.
		"""

		from __future__ import annotations

		import json
		import pytest
		from pathlib import Path
		from unittest.mock import MagicMock, patch

		from truthound.cli_modules.common.datasource import (
		create_datasource_from_config,
		parse_source_config,
		resolve_compare_sources,
		resolve_datasource,
		)
		from truthound.cli_modules.common.errors import DataSourceError


		# =============================================================================
		# Fixtures
		# =============================================================================


		@pytest.fixture
		def sample_csv(tmp_path):
		"""Create a sample CSV file."""
		csv = tmp_path / "data.csv"
		csv.write_text("id,name\n1,Alice\n2,Bob\n")
		return csv


		@pytest.fixture
		def source_config_json(tmp_path):
		"""Create a JSON source config file."""
		cfg = tmp_path / "source.json"
		cfg.write_text(json.dumps({
		"connection": "postgresql://user:pass@host:5432/db",
		"table": "users",
		}))
		return cfg


		@pytest.fixture
		def source_config_yaml(tmp_path):
		"""Create a YAML source config file."""
		cfg = tmp_path / "source.yaml"
		cfg.write_text(
		"connection: 'postgresql://user:pass@host:5432/db'\n"
		"table: users\n"
		)
		return cfg


		@pytest.fixture
		def compare_config_yaml(tmp_path):
		"""Create a YAML compare config file with baseline/current sections."""
		cfg = tmp_path / "compare.yaml"
		cfg.write_text(
		"baseline:\n"
		" connection: 'postgresql://user:pass@host/db'\n"
		" table: train_data\n"
		"current:\n"
		" connection: 'postgresql://user:pass@host/db'\n"
		" table: prod_data\n"
		)
		return cfg


		# =============================================================================
		# resolve_datasource
		# =============================================================================


		class TestResolveDatasource:
		"""Tests for resolve_datasource()."""

		def test_file_only_returns_path(self, sample_csv):
		"""File-only input returns (str_path, None)."""
		data_path, source = resolve_datasource(file=sample_csv)
		assert data_path == str(sample_csv)
		assert source is None

		def test_no_input_raises_error(self):
		"""No input raises DataSourceError."""
		with pytest.raises(DataSourceError, match="No data input specified"):
		resolve_datasource()

		def test_file_not_found_raises_error(self, tmp_path):
		"""Non-existent file raises error."""
		fake = tmp_path / "nonexistent.csv"
		with pytest.raises(Exception):
		resolve_datasource(file=fake)

		def test_file_and_connection_mutually_exclusive(self, sample_csv):
		"""Providing both file and connection raises error."""
		with pytest.raises(DataSourceError, match="Conflicting"):
		resolve_datasource(file=sample_csv, connection="postgresql://host/db")

		def test_file_and_source_config_mutually_exclusive(self, sample_csv, source_config_json):
		"""Providing both file and source_config raises error."""
		with pytest.raises(DataSourceError, match="Conflicting"):
		resolve_datasource(file=sample_csv, source_config=source_config_json)

		def test_connection_and_source_config_mutually_exclusive(self, source_config_json):
		"""Providing both connection and source_config raises error."""
		with pytest.raises(DataSourceError, match="Conflicting"):
		resolve_datasource(
		connection="postgresql://host/db",
		source_config=source_config_json,
		)

		def test_connection_without_table_raises_error(self):
		"""Connection without table or query raises error."""
		with pytest.raises(DataSourceError, match="--table or --query"):
		resolve_datasource(connection="postgresql://user:pass@host/db")

		@patch("truthound.datasources.factory.get_sql_datasource")
		def test_connection_with_table_returns_source(self, mock_get_sql):
		"""Connection + table returns (None, source)."""
		mock_source = MagicMock()
		mock_get_sql.return_value = mock_source

		data_path, source = resolve_datasource(
		connection="postgresql://user:pass@host/db",
		table="users",
		)
		assert data_path is None
		assert source is mock_source
		mock_get_sql.assert_called_once_with(
		"postgresql://user:pass@host/db", table="users", query=None
		)

		@patch("truthound.datasources.factory.get_sql_datasource")
		def test_connection_with_query_returns_source(self, mock_get_sql):
		"""Connection + query returns (None, source)."""
		mock_source = MagicMock()
		mock_get_sql.return_value = mock_source

		data_path, source = resolve_datasource(
		connection="postgresql://user:pass@host/db",
		query="SELECT * FROM orders WHERE date > '2024-01-01'",
		)
		assert data_path is None
		assert source is mock_source

		@patch("truthound.cli_modules.common.datasource.create_datasource_from_config")
		@patch("truthound.cli_modules.common.datasource.parse_source_config")
		def test_source_config_returns_source(self, mock_parse, mock_create, source_config_json):
		"""Source config file returns (None, source)."""
		mock_config = {"connection": "postgresql://...", "table": "users"}
		mock_parse.return_value = mock_config
		mock_source = MagicMock()
		mock_create.return_value = mock_source

		data_path, source = resolve_datasource(source_config=source_config_json)
		assert data_path is None
		assert source is mock_source
		mock_parse.assert_called_once_with(source_config_json)
		mock_create.assert_called_once_with(mock_config)

		@patch("truthound.datasources.factory.get_sql_datasource")
		def test_source_name_applied(self, mock_get_sql):
		"""--source-name is applied to the data source."""
		mock_source = MagicMock()
		mock_source.config = MagicMock()
		mock_get_sql.return_value = mock_source

		resolve_datasource(
		connection="postgresql://host/db",
		table="users",
		source_name="my-label",
		)
		# source_name should have been set
		assert mock_source.config.name == "my-label"


		# =============================================================================
		# resolve_compare_sources
		# =============================================================================


		class TestResolveCompareSources:
		"""Tests for resolve_compare_sources()."""

		def test_two_files_returns_paths(self, tmp_path):
		"""Two file paths return ((path1, None), (path2, None))."""
		f1 = tmp_path / "base.csv"
		f2 = tmp_path / "curr.csv"
		f1.write_text("a\n1\n")
		f2.write_text("a\n2\n")

		(bp, bs), (cp, cs) = resolve_compare_sources(baseline=f1, current=f2)
		assert bp == str(f1) and bs is None
		assert cp == str(f2) and cs is None

		def test_missing_one_file_raises_error(self, tmp_path):
		"""Only one file provided raises error."""
		f1 = tmp_path / "base.csv"
		f1.write_text("a\n1\n")

		with pytest.raises(DataSourceError, match="Both baseline and current"):
		resolve_compare_sources(baseline=f1)

		def test_no_files_no_config_raises_error(self):
		"""No arguments raises error."""
		with pytest.raises(DataSourceError, match="Both baseline and current"):
		resolve_compare_sources()

		def test_files_and_config_raises_error(self, tmp_path, compare_config_yaml):
		"""Files + config raises error."""
		f1 = tmp_path / "base.csv"
		f1.write_text("a\n1\n")

		with pytest.raises(DataSourceError, match="Cannot specify both"):
		resolve_compare_sources(baseline=f1, source_config=compare_config_yaml)

		@patch("truthound.cli_modules.common.datasource.create_datasource_from_config")
		@patch("truthound.cli_modules.common.datasource.parse_source_config")
		def test_config_returns_dual_sources(self, mock_parse, mock_create, compare_config_yaml):
		"""Config file with baseline/current returns two sources."""
		mock_parse.return_value = {
		"baseline": {"connection": "pg://...", "table": "train"},
		"current": {"connection": "pg://...", "table": "prod"},
		}
		mock_source_b = MagicMock()
		mock_source_c = MagicMock()
		mock_create.side_effect = [mock_source_b, mock_source_c]

		(bp, bs), (cp, cs) = resolve_compare_sources(source_config=compare_config_yaml)
		assert bp is None and bs is mock_source_b
		assert cp is None and cs is mock_source_c

		@patch("truthound.cli_modules.common.datasource.parse_source_config")
		def test_config_missing_baseline_raises_error(self, mock_parse, compare_config_yaml):
		"""Config missing baseline section raises error."""
		mock_parse.return_value = {"current": {"connection": "pg://...", "table": "t"}}

		with pytest.raises(DataSourceError, match="baseline.*current"):
		resolve_compare_sources(source_config=compare_config_yaml)


		# =============================================================================
		# parse_source_config
		# =============================================================================


		class TestParseSourceConfig:
		"""Tests for parse_source_config()."""

		def test_parse_json(self, tmp_path):
		"""JSON config file is parsed correctly."""
		cfg = tmp_path / "cfg.json"
		cfg.write_text(json.dumps({"connection": "pg://host/db", "table": "t"}))

		result = parse_source_config(cfg)
		assert result["connection"] == "pg://host/db"
		assert result["table"] == "t"

		def test_parse_yaml(self, tmp_path):
		"""YAML config file is parsed correctly."""
		cfg = tmp_path / "cfg.yaml"
		cfg.write_text("connection: 'pg://host/db'\ntable: t\n")

		result = parse_source_config(cfg)
		assert result["connection"] == "pg://host/db"
		assert result["table"] == "t"

		def test_parse_yml(self, tmp_path):
		"""YML extension also works."""
		cfg = tmp_path / "cfg.yml"
		cfg.write_text("connection: 'pg://host/db'\ntable: t\n")

		result = parse_source_config(cfg)
		assert result["table"] == "t"

		def test_invalid_json_raises_error(self, tmp_path):
		"""Malformed JSON raises DataSourceError."""
		cfg = tmp_path / "bad.json"
		cfg.write_text("{invalid json}")

		with pytest.raises(DataSourceError, match="Invalid JSON"):
		parse_source_config(cfg)

		def test_non_dict_raises_error(self, tmp_path):
		"""Non-dict JSON content raises DataSourceError."""
		cfg = tmp_path / "arr.json"
		cfg.write_text('["a", "b"]')

		with pytest.raises(DataSourceError, match="must be a JSON/YAML object"):
		parse_source_config(cfg)

		def test_unsupported_extension_raises_error(self, tmp_path):
		"""Unsupported file extension raises DataSourceError."""
		cfg = tmp_path / "cfg.toml"
		cfg.write_text("[table]\nname = 'x'")

		with pytest.raises(DataSourceError, match="Unsupported config file format"):
		parse_source_config(cfg)


		# =============================================================================
		# create_datasource_from_config
		# =============================================================================


		class TestCreateDatasourceFromConfig:
		"""Tests for create_datasource_from_config()."""

		@patch("truthound.datasources.factory.get_sql_datasource")
		def test_connection_string_style(self, mock_get_sql):
		"""Config with 'connection' delegates to get_sql_datasource."""
		mock_source = MagicMock()
		mock_get_sql.return_value = mock_source

		result = create_datasource_from_config({
		"connection": "postgresql://host/db",
		"table": "users",
		})
		assert result is mock_source

		def test_connection_without_table_raises_error(self):
		"""Config with 'connection' but no 'table' raises error."""
		with pytest.raises(DataSourceError, match="requires 'table' or 'query'"):
		create_datasource_from_config({"connection": "postgresql://host/db"})

		def test_no_connection_no_type_raises_error(self):
		"""Config without 'connection' or 'type' raises error."""
		with pytest.raises(DataSourceError, match="must have either"):
		create_datasource_from_config({"table": "users"})

		@patch("truthound.datasources.sql.get_available_sources")
		def test_type_not_available_raises_error(self, mock_available):
		"""Unavailable type raises DataSourceError with available list."""
		mock_available.return_value = {"postgresql": MagicMock(), "mysql": MagicMock()}

		with pytest.raises(DataSourceError, match="not available"):
		create_datasource_from_config({
		"type": "oracle",
		"table": "users",
		"host": "localhost",
		})

		@patch("truthound.datasources.sql.get_available_sources")
		def test_type_style_creates_source(self, mock_available):
		"""Config with 'type' constructs from source class."""
		mock_cls = MagicMock()
		mock_source = MagicMock()
		mock_cls.return_value = mock_source
		mock_available.return_value = {"postgresql": mock_cls}

		result = create_datasource_from_config({
		"type": "postgresql",
		"table": "users",
		"host": "localhost",
		"database": "mydb",
		})
		assert result is mock_source
		mock_cls.assert_called_once_with(
		table="users", host="localhost", database="mydb"
		)

+260

tests/cli_modules/test_read_command.py

		"""Tests for the ``truthound read`` CLI command."""

		from __future__ import annotations

		import json
		import pytest
		from pathlib import Path
		from unittest.mock import patch, MagicMock

		import typer
		from typer.testing import CliRunner

		from truthound.cli_modules.core.read import read_cmd


		@pytest.fixture
		def runner():
		return CliRunner()


		@pytest.fixture
		def app():
		_app = typer.Typer()
		_app.command(name="read")(read_cmd)
		return _app


		@pytest.fixture
		def sample_csv(tmp_path):
		csv = tmp_path / "data.csv"
		csv.write_text(
		"id,name,age,city\n"
		"1,Alice,25,NYC\n"
		"2,Bob,30,LA\n"
		"3,Charlie,35,Chicago\n"
		"4,Diana,40,Boston\n"
		"5,Eve,28,Seattle\n"
		)
		return csv


		@pytest.fixture
		def sample_json(tmp_path):
		jf = tmp_path / "data.json"
		data = [
		{"id": 1, "name": "Alice"},
		{"id": 2, "name": "Bob"},
		]
		jf.write_text(json.dumps(data))
		return jf


		# =============================================================================
		# Basic Read
		# =============================================================================


		class TestReadBasic:
		"""Basic file reading tests."""

		def test_read_csv(self, runner, app, sample_csv):
		"""Read CSV file outputs data."""
		result = runner.invoke(app, [str(sample_csv)])
		assert result.exit_code == 0
		assert "5 rows" in result.output or "Shape" in result.output

		def test_read_no_input_error(self, runner, app):
		"""No input produces an error."""
		result = runner.invoke(app, [])
		assert result.exit_code != 0

		def test_read_nonexistent_file_error(self, runner, app, tmp_path):
		"""Non-existent file produces an error."""
		fake = tmp_path / "missing.csv"
		result = runner.invoke(app, [str(fake)])
		assert result.exit_code != 0


		# =============================================================================
		# Row/Column Selection
		# =============================================================================


		class TestReadSelection:
		"""Row and column selection tests."""

		def test_head(self, runner, app, sample_csv):
		"""--head limits rows."""
		result = runner.invoke(app, [str(sample_csv), "--head", "2"])
		assert result.exit_code == 0
		assert "2 rows" in result.output or "Shape: 2" in result.output

		def test_columns(self, runner, app, sample_csv):
		"""--columns selects specific columns."""
		result = runner.invoke(app, [str(sample_csv), "--columns", "id,name"])
		assert result.exit_code == 0
		assert "2 columns" in result.output or "x 2" in result.output

		def test_columns_missing_warns(self, runner, app, sample_csv):
		"""Missing columns produce a warning."""
		result = runner.invoke(app, [str(sample_csv), "--columns", "id,nonexistent"])
		assert result.exit_code == 0
		assert "not found" in result.output

		def test_head_and_columns(self, runner, app, sample_csv):
		"""--head and --columns together."""
		result = runner.invoke(app, [str(sample_csv), "--head", "3", "--columns", "name,age"])
		assert result.exit_code == 0

		def test_sample(self, runner, app, sample_csv):
		"""--sample returns subset."""
		result = runner.invoke(app, [str(sample_csv), "--sample", "2"])
		assert result.exit_code == 0
		assert "2 rows" in result.output or "Shape: 2" in result.output


		# =============================================================================
		# Inspection Modes
		# =============================================================================


		class TestReadInspection:
		"""Schema-only and count-only mode tests."""

		def test_schema_only(self, runner, app, sample_csv):
		"""--schema-only shows column names and types."""
		result = runner.invoke(app, [str(sample_csv), "--schema-only"])
		assert result.exit_code == 0
		assert "Column" in result.output
		assert "Type" in result.output
		assert "id" in result.output
		assert "name" in result.output

		def test_count_only(self, runner, app, sample_csv):
		"""--count-only shows just the row count."""
		result = runner.invoke(app, [str(sample_csv), "--count-only"])
		assert result.exit_code == 0
		assert "Rows:" in result.output
		assert "5" in result.output


		# =============================================================================
		# Output Formats
		# =============================================================================


		class TestReadFormats:
		"""Output format tests."""

		def test_format_csv(self, runner, app, sample_csv):
		"""--format csv outputs CSV text."""
		result = runner.invoke(app, [str(sample_csv), "--format", "csv", "--head", "2"])
		assert result.exit_code == 0
		assert "id,name,age,city" in result.output

		def test_format_json(self, runner, app, sample_csv):
		"""--format json outputs valid JSON."""
		result = runner.invoke(app, [str(sample_csv), "--format", "json", "--head", "2"])
		assert result.exit_code == 0
		data = json.loads(result.output)
		# Polars write_json output is valid JSON (format may vary by version)
		assert isinstance(data, (dict, list))

		def test_format_ndjson(self, runner, app, sample_csv):
		"""--format ndjson outputs newline-delimited JSON."""
		result = runner.invoke(app, [str(sample_csv), "--format", "ndjson", "--head", "2"])
		assert result.exit_code == 0
		lines = [l for l in result.output.strip().split("\n") if l.strip()]
		assert len(lines) == 2

		def test_parquet_requires_output(self, runner, app, sample_csv):
		"""--format parquet without --output is an error."""
		result = runner.invoke(app, [str(sample_csv), "--format", "parquet"])
		assert result.exit_code == 1
		assert "required" in result.output.lower()


		# =============================================================================
		# Output File
		# =============================================================================


		class TestReadOutput:
		"""Output file tests."""

		def test_output_csv(self, runner, app, sample_csv, tmp_path):
		"""--output writes CSV file."""
		out = tmp_path / "out.csv"
		result = runner.invoke(app, [str(sample_csv), "--output", str(out), "--head", "3"])
		assert result.exit_code == 0
		assert out.exists()
		assert "written to" in result.output
		content = out.read_text()
		assert "id" in content

		def test_output_json(self, runner, app, sample_csv, tmp_path):
		"""--output with json format writes JSON file."""
		out = tmp_path / "out.json"
		result = runner.invoke(app, [
		str(sample_csv), "--output", str(out), "--format", "json", "--head", "2",
		])
		assert result.exit_code == 0
		assert out.exists()

		def test_output_parquet(self, runner, app, sample_csv, tmp_path):
		"""--output with parquet format writes Parquet file."""
		out = tmp_path / "out.parquet"
		result = runner.invoke(app, [
		str(sample_csv), "--output", str(out), "--format", "parquet",
		])
		assert result.exit_code == 0
		assert out.exists()
		assert out.stat().st_size > 0


		# =============================================================================
		# DataSource Integration (mocked)
		# =============================================================================


		class TestReadWithConnection:
		"""Test read command with mocked database connection."""

		@patch("truthound.datasources.factory.get_sql_datasource")
		def test_read_with_connection(self, mock_get_sql, runner, app):
		"""--connection + --table uses DataSource."""
		import polars as pl

		mock_source = MagicMock()
		mock_source.name = "test_table"
		mock_lf = pl.LazyFrame({"id": [1, 2], "name": ["a", "b"]})
		mock_source.to_polars_lazyframe.return_value = mock_lf
		mock_get_sql.return_value = mock_source

		result = runner.invoke(app, [
		"--connection", "postgresql://user:pass@host/db",
		"--table", "users",
		])
		assert result.exit_code == 0
		assert "2 rows" in result.output or "Shape" in result.output

		@patch("truthound.datasources.factory.get_sql_datasource")
		def test_read_schema_only_with_connection(self, mock_get_sql, runner, app):
		"""--schema-only works with database source."""
		import polars as pl

		mock_source = MagicMock()
		mock_source.name = "test_table"
		mock_lf = pl.LazyFrame({"id": [1], "name": ["a"]})
		mock_source.to_polars_lazyframe.return_value = mock_lf
		mock_get_sql.return_value = mock_source

		result = runner.invoke(app, [
		"--connection", "postgresql://user:pass@host/db",
		"--table", "users",
		"--schema-only",
		])
		assert result.exit_code == 0
		assert "id" in result.output
		assert "name" in result.output

+95

-2

docs/cli/core/check.md

		@@ -8,3 +8,3 @@ # truthound check
		```bash
		truthound check <file> [OPTIONS]
		truthound check [FILE] [OPTIONS]
		```
		@@ -16,4 +16,14 @@
		\|----------\|----------\|-------------\|
		\| `file` \| Yes \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|
		\| `file` \| No \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|

		## Data Source Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--connection` \| `--conn` \| None \| Database connection string \|
		\| `--table` \| \| None \| Database table name \|
		\| `--query` \| \| None \| SQL query (alternative to `--table`) \|
		\| `--source-config` \| `--sc` \| None \| Path to data source config file (JSON/YAML) \|
		\| `--source-name` \| \| None \| Custom label for the data source \|

		## Options
		@@ -42,2 +52,5 @@
		\| `--max-unexpected-rows` \| \| `1000` \| Maximum number of unexpected rows to include \|
		\| `--partial-unexpected-count` \| \| `20` \| Maximum number of unexpected values in partial list (BASIC+) \|
		\| `--include-unexpected-index` \| \| `false` \| Include row index for each unexpected value in results \|
		\| `--return-debug-query` \| \| `false` \| Include Polars debug query expression in results (COMPLETE level) \|

		@@ -52,2 +65,11 @@ ### Exception Handling Options (VE-5)

		### Execution Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--parallel` / `--no-parallel` \| \| `false` \| Enable DAG-based parallel execution with dependency-aware scheduling \|
		\| `--max-workers` \| \| Auto \| Maximum worker threads (only with `--parallel`). Defaults to `min(32, cpu_count + 4)` \|
		\| `--pushdown` / `--no-pushdown` \| \| Auto \| Enable query pushdown for SQL data sources. Auto-detects by default \|
		\| `--use-engine` / `--no-use-engine` \| \| `false` \| Use execution engine for validation (experimental) \|

		## Description
		@@ -149,2 +171,73 @@

		### Parallel Execution

		Enable DAG-based parallel execution for large validator sets:

		```bash
		# Enable parallel execution with automatic worker count
		truthound check data.csv --parallel

		# Control the number of worker threads
		truthound check data.csv --parallel --max-workers 8

		# Combine with other options
		truthound check data.csv --parallel --max-workers 4 --rf summary --strict
		```

		Validators are organized into dependency levels (Schema → Completeness → Uniqueness → Distribution → Referential) and executed concurrently within each level.

		### Advanced Result Format Control

		Fine-tune the detail level of validation results:

		```bash
		# Control partial unexpected list size
		truthound check data.csv --rf basic --partial-unexpected-count 50

		# Include row indices for unexpected values
		truthound check data.csv --rf summary --include-unexpected-index

		# Include Polars debug query in results (for troubleshooting)
		truthound check data.csv --rf complete --return-debug-query

		# All fine-grained options combined
		truthound check data.csv --rf complete \
		--include-unexpected-rows \
		--max-unexpected-rows 500 \
		--partial-unexpected-count 100 \
		--include-unexpected-index \
		--return-debug-query
		```

		### Database Validation

		Validate data directly from a database connection:

		```bash
		# Validate a PostgreSQL table
		truthound check --connection "postgresql://user:pass@host/db" --table users

		# Validate with a SQL query
		truthound check --connection "sqlite:///data.db" --query "SELECT * FROM orders WHERE status = 'active'"

		# Validate using a source config file
		truthound check --source-config db_config.yaml --strict

		# Combine with other options
		truthound check --connection "postgresql://user:pass@host/db" --table users \
		-v null,unique --rf summary --strict
		```

		### Query Pushdown

		For SQL data sources, enable server-side validation:

		```bash
		# Auto-detect pushdown capability
		truthound check data.csv --pushdown

		# Explicitly disable pushdown
		truthound check data.csv --no-pushdown
		```

		### Exception Handling (VE-5)
		@@ -151,0 +244,0 @@

+21

-0

docs/cli/core/compare.md

		@@ -18,2 +18,9 @@ # truthound compare

		## Data Source Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--source-config` \| `--sc` \| None \| Path to data source config file (JSON/YAML) for dual-source comparison \|
		\| `--source-name` \| \| None \| Custom label for the data source \|

		## Options
		@@ -26,2 +33,3 @@
		\| `--threshold` \| `-t` \| Auto \| Custom drift threshold \|
		\| `--sample-size` \| `--sample` \| None \| Sample size for large datasets (random sampling) \|
		\| `--format` \| `-f` \| `console` \| Output format (console, json) \|
		@@ -133,2 +141,15 @@ \| `--output` \| `-o` \| None \| Output file path \|

		### Large Dataset Sampling

		For large datasets, use sampling for faster comparison:

		```bash
		# Sample 10,000 rows from each dataset
		truthound compare big_train.csv big_prod.csv --sample-size 10000

		# Combine with method and threshold
		truthound compare large_baseline.parquet large_current.parquet \
		--sample-size 50000 --method psi --threshold 0.15
		```

		### Custom Threshold
		@@ -135,0 +156,0 @@

+27

-0

docs/cli/core/index.md

		@@ -14,2 +14,3 @@ # Core Commands
		\| [`profile`](profile.md) \| Generate data profile \| Data exploration \|
		\| [`read`](read.md) \| Read and preview data \| Data inspection \|
		\| [`compare`](compare.md) \| Detect data drift \| Model monitoring \|
		@@ -114,2 +115,27 @@

		## Data Source Options

		All core commands accept data source options for reading directly from databases instead of files. When using these options, the file argument becomes optional.

		\| Option \| Short \| Description \|
		\|--------\|-------\|-------------\|
		\| `--connection` \| `--conn` \| Database connection string (e.g., `postgresql://user:pass@host/db`) \|
		\| `--table` \| \| Database table name \|
		\| `--query` \| \| SQL query (alternative to `--table`) \|
		\| `--source-config` \| `--sc` \| Path to a data source config file (JSON/YAML) \|
		\| `--source-name` \| \| Custom label for the data source \|

		```bash
		# Validate a database table directly
		truthound check --connection "postgresql://user:pass@host/db" --table users --strict

		# Profile from a source config file
		truthound profile --source-config prod_db.yaml

		# Read and preview database data
		truthound read --connection "sqlite:///data.db" --table orders --head 20
		```

		For full details on connection string formats, config files, and security best practices, see the [CLI Data Source Guide](../../guides/datasources/cli-datasource-guide.md).

		## CI/CD Integration
		@@ -130,2 +156,3 @@

		- [read](read.md) - Read and preview data
		- [learn](learn.md) - Learn schema from data
		@@ -132,0 +159,0 @@ - [check](check.md) - Validate data quality

+30

-2

docs/cli/core/learn.md

		@@ -8,3 +8,3 @@ # truthound learn
		```bash
		truthound learn <file> [OPTIONS]
		truthound learn [FILE] [OPTIONS]
		```
		@@ -16,4 +16,14 @@
		\|----------\|----------\|-------------\|
		\| `file` \| Yes \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|
		\| `file` \| No \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|

		## Data Source Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--connection` \| `--conn` \| None \| Database connection string \|
		\| `--table` \| \| None \| Database table name \|
		\| `--query` \| \| None \| SQL query (alternative to `--table`) \|
		\| `--source-config` \| `--sc` \| None \| Path to data source config file (JSON/YAML) \|
		\| `--source-name` \| \| None \| Custom label for the data source \|

		## Options
		@@ -25,2 +35,3 @@
		\| `--no-constraints` \| \| `false` \| Don't infer constraints from data \|
		\| `--categorical-threshold` \| \| `20` \| Maximum unique values to treat a column as categorical \|

		@@ -73,2 +84,19 @@ ## Description

		### Categorical Threshold

		Control when columns are treated as categorical:

		```bash
		# Default: columns with <= 20 unique values are categorical
		truthound learn data.csv

		# Higher threshold: treat columns with up to 50 unique values as categorical
		truthound learn data.csv --categorical-threshold 50

		# Lower threshold: only truly low-cardinality columns
		truthound learn data.csv --categorical-threshold 5
		```

		Columns classified as categorical will have `allowed_values` in the generated schema, enabling strict enum validation.

		### From Different File Formats
		@@ -75,0 +103,0 @@

+12

-2

docs/cli/core/mask.md

		@@ -8,3 +8,3 @@ # truthound mask
		```bash
		truthound mask <file> -o <output> [OPTIONS]
		truthound mask [FILE] -o <output> [OPTIONS]
		```
		@@ -16,4 +16,14 @@
		\|----------\|----------\|-------------\|
		\| `file` \| Yes \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|
		\| `file` \| No \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|

		## Data Source Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--connection` \| `--conn` \| None \| Database connection string \|
		\| `--table` \| \| None \| Database table name \|
		\| `--query` \| \| None \| SQL query (alternative to `--table`) \|
		\| `--source-config` \| `--sc` \| None \| Path to data source config file (JSON/YAML) \|
		\| `--source-name` \| \| None \| Custom label for the data source \|

		## Options
		@@ -20,0 +30,0 @@

+12

-2

docs/cli/core/profile.md

		@@ -8,3 +8,3 @@ # truthound profile
		```bash
		truthound profile <file> [OPTIONS]
		truthound profile [FILE] [OPTIONS]
		```
		@@ -16,4 +16,14 @@
		\|----------\|----------\|-------------\|
		\| `file` \| Yes \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|
		\| `file` \| No \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|

		## Data Source Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--connection` \| `--conn` \| None \| Database connection string \|
		\| `--table` \| \| None \| Database table name \|
		\| `--query` \| \| None \| SQL query (alternative to `--table`) \|
		\| `--source-config` \| `--sc` \| None \| Path to data source config file (JSON/YAML) \|
		\| `--source-name` \| \| None \| Custom label for the data source \|

		## Options
		@@ -20,0 +30,0 @@

+12

-2

docs/cli/core/scan.md

		@@ -8,3 +8,3 @@ # truthound scan
		```bash
		truthound scan <file> [OPTIONS]
		truthound scan [FILE] [OPTIONS]
		```
		@@ -16,4 +16,14 @@
		\|----------\|----------\|-------------\|
		\| `file` \| Yes \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|
		\| `file` \| No \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|

		## Data Source Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--connection` \| `--conn` \| None \| Database connection string \|
		\| `--table` \| \| None \| Database table name \|
		\| `--query` \| \| None \| SQL query (alternative to `--table`) \|
		\| `--source-config` \| `--sc` \| None \| Path to data source config file (JSON/YAML) \|
		\| `--source-name` \| \| None \| Custom label for the data source \|

		## Options
		@@ -20,0 +30,0 @@

+19

-5

PKG-INFO

		Metadata-Version: 2.4
		Name: truthound
		Version: 1.3.2
		Version: 1.5.0
		Summary: Zero-Configuration Data Quality Framework Powered by Polars
		@@ -145,3 +145,3 @@ Project-URL: Homepage, https://github.com/seadonggyun4/Truthound
		\| Test Cases \| 8,585+ \|
		\| Validators \| 264 \|
		\| Validators \| 289 \|
		\| Validator Categories \| 28 \|
		@@ -208,6 +208,17 @@ \| VE Test Cases \| 316 (Validation Engine Enhancement) \|
		truthound check data.csv --catch-exceptions --max-retries 2 # Resilient mode
		truthound check data.csv --parallel --max-workers 8 # DAG parallel execution
		truthound check data.csv --return-debug-query --rf complete # Debug query output
		truthound compare baseline.csv current.csv # Drift detection
		truthound compare big.csv new.csv --sample-size 10000 # Sampled comparison
		truthound learn data.csv --categorical-threshold 50 # Custom threshold
		truthound scan data.csv # PII scanning
		truthound auto-profile data.csv # Profiling
		truthound new validator my_validator # Code scaffolding

		# Database connections (all core commands support --connection/--table)
		truthound check --connection "postgresql://user:pass@host/db" --table users
		truthound scan --connection "sqlite:///data.db" --table orders
		truthound read --connection "postgresql://host/db" --table users --head 20
		truthound read data.csv --schema-only # Inspect schema
		truthound compare --source-config drift.yaml # Dual-source drift detection
		```
		@@ -223,9 +234,12 @@
		\|---------\|-------------\|-------------\|
		\| `learn` \| Learn schema from data \| `--output`, `--no-constraints` \|
		\| `check` \| Validate data quality \| `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries` \|
		\| `read` \| Read and preview data \| `--head`, `--sample`, `--columns`, `--schema-only`, `--count-only`, `--format` \|
		\| `learn` \| Learn schema from data \| `--output`, `--no-constraints`, `--categorical-threshold` \|
		\| `check` \| Validate data quality \| `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries`, `--parallel`, `--max-workers`, `--pushdown`, `--partial-unexpected-count`, `--return-debug-query`, `--include-unexpected-index` \|
		\| `scan` \| Scan for PII \| `--format`, `--output` \|
		\| `mask` \| Mask sensitive data \| `--columns`, `--strategy` (redact/hash/fake), `--strict` \|
		\| `profile` \| Generate data profile \| `--format`, `--output` \|
		\| `compare` \| Detect data drift \| `--method` (auto/ks/psi/chi2/js), `--threshold`, `--strict` \|
		\| `compare` \| Detect data drift \| `--method` (14 methods), `--threshold`, `--sample-size`, `--strict` \|

		All core commands accept Data Source Options: `--connection`/`--conn`, `--table`, `--query`, `--source-config`/`--sc`, `--source-name` for database connectivity (PostgreSQL, MySQL, SQLite, DuckDB, SQL Server, etc.).

		### Profiler Commands
		@@ -232,0 +246,0 @@

+1

-1

pyproject.toml

		@@ -7,3 +7,3 @@ [build-system]
		name = "truthound"
		version = "1.3.2"
		version = "1.5.0"
		description = "Zero-Configuration Data Quality Framework Powered by Polars"
		@@ -10,0 +10,0 @@ readme = "README.md"

+18

-4

README.md

		@@ -48,3 +48,3 @@ <div align="center">
		\| Test Cases \| 8,585+ \|
		\| Validators \| 264 \|
		\| Validators \| 289 \|
		\| Validator Categories \| 28 \|
		@@ -111,6 +111,17 @@ \| VE Test Cases \| 316 (Validation Engine Enhancement) \|
		truthound check data.csv --catch-exceptions --max-retries 2 # Resilient mode
		truthound check data.csv --parallel --max-workers 8 # DAG parallel execution
		truthound check data.csv --return-debug-query --rf complete # Debug query output
		truthound compare baseline.csv current.csv # Drift detection
		truthound compare big.csv new.csv --sample-size 10000 # Sampled comparison
		truthound learn data.csv --categorical-threshold 50 # Custom threshold
		truthound scan data.csv # PII scanning
		truthound auto-profile data.csv # Profiling
		truthound new validator my_validator # Code scaffolding

		# Database connections (all core commands support --connection/--table)
		truthound check --connection "postgresql://user:pass@host/db" --table users
		truthound scan --connection "sqlite:///data.db" --table orders
		truthound read --connection "postgresql://host/db" --table users --head 20
		truthound read data.csv --schema-only # Inspect schema
		truthound compare --source-config drift.yaml # Dual-source drift detection
		```
		@@ -126,9 +137,12 @@
		\|---------\|-------------\|-------------\|
		\| `learn` \| Learn schema from data \| `--output`, `--no-constraints` \|
		\| `check` \| Validate data quality \| `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries` \|
		\| `read` \| Read and preview data \| `--head`, `--sample`, `--columns`, `--schema-only`, `--count-only`, `--format` \|
		\| `learn` \| Learn schema from data \| `--output`, `--no-constraints`, `--categorical-threshold` \|
		\| `check` \| Validate data quality \| `--validators`, `--exclude-columns`, `--validator-config`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries`, `--parallel`, `--max-workers`, `--pushdown`, `--partial-unexpected-count`, `--return-debug-query`, `--include-unexpected-index` \|
		\| `scan` \| Scan for PII \| `--format`, `--output` \|
		\| `mask` \| Mask sensitive data \| `--columns`, `--strategy` (redact/hash/fake), `--strict` \|
		\| `profile` \| Generate data profile \| `--format`, `--output` \|
		\| `compare` \| Detect data drift \| `--method` (auto/ks/psi/chi2/js), `--threshold`, `--strict` \|
		\| `compare` \| Detect data drift \| `--method` (14 methods), `--threshold`, `--sample-size`, `--strict` \|

		All core commands accept Data Source Options: `--connection`/`--conn`, `--table`, `--query`, `--source-config`/`--sc`, `--source-name` for database connectivity (PostgreSQL, MySQL, SQLite, DuckDB, SQL Server, etc.).

		### Profiler Commands
		@@ -135,0 +149,0 @@

+28

-0

src/truthound/cli_modules/common/errors.py

		@@ -59,3 +59,7 @@ """CLI error handling utilities.

		# DataSource errors (55-59)
		DATASOURCE_ERROR = 55
		DATASOURCE_CONNECTION_ERROR = 56


		# =============================================================================
		@@ -226,2 +230,26 @@ # Exception Classes

		class DataSourceError(CLIError):
		"""Error with data source connection or configuration."""

		def __init__(
		self,
		message: str,
		source_type: str \| None = None,
		hint: str \| None = None,
		) -> None:
		"""Initialize data source error.

		Args:
		message: Error message
		source_type: Type of data source (e.g., "postgresql", "mysql")
		hint: Resolution hint
		"""
		super().__init__(
		message=message,
		code=ErrorCode.DATASOURCE_ERROR,
		details={"source_type": source_type} if source_type else {},
		hint=hint or "Check the connection string, credentials, and table name.",
		)


		# =============================================================================
		@@ -228,0 +256,0 @@ # Error Handler

+89

-0

src/truthound/cli_modules/common/options.py

		@@ -340,3 +340,92 @@ """Reusable CLI options and arguments.

		# Parallel execution (DAG-based)
		ParallelOpt = Annotated[
		bool,
		typer.Option(
		"--parallel/--no-parallel",
		help=(
		"Enable DAG-based parallel execution. "
		"Validators are grouped by dependency level and executed concurrently."
		),
		),
		]

		# Max workers for parallel execution
		MaxWorkersOpt = Annotated[
		int \| None,
		typer.Option(
		"--max-workers",
		help=(
		"Maximum worker threads for parallel execution. "
		"Only effective with --parallel. "
		"Defaults to min(32, cpu_count + 4)."
		),
		min=1,
		),
		]

		# Query pushdown for SQL data sources
		PushdownOpt = Annotated[
		bool \| None,
		typer.Option(
		"--pushdown/--no-pushdown",
		help=(
		"Enable query pushdown for SQL data sources. "
		"Validation logic is executed server-side when possible. "
		"Default: auto-detect based on data source type."
		),
		),
		]

		# Execution engine (experimental)
		UseEngineOpt = Annotated[
		bool,
		typer.Option(
		"--use-engine/--no-use-engine",
		help="Use execution engine for validation (experimental).",
		),
		]

		# Partial unexpected count
		PartialUnexpectedCountOpt = Annotated[
		int,
		typer.Option(
		"--partial-unexpected-count",
		help="Maximum number of unexpected values in partial list (BASIC+).",
		min=0,
		),
		]

		# Include unexpected index
		IncludeUnexpectedIndexOpt = Annotated[
		bool,
		typer.Option(
		"--include-unexpected-index",
		help="Include row index for each unexpected value in results.",
		),
		]

		# Return debug query
		ReturnDebugQueryOpt = Annotated[
		bool,
		typer.Option(
		"--return-debug-query",
		help="Include Polars debug query expression in results (COMPLETE level).",
		),
		]

		# Categorical threshold for schema learning
		CategoricalThresholdOpt = Annotated[
		int,
		typer.Option(
		"--categorical-threshold",
		help=(
		"Maximum unique values to treat a column as categorical "
		"during schema inference."
		),
		min=1,
		),
		]


		# =============================================================================
		@@ -343,0 +432,0 @@ # Option Groups (for related options)

+4

-0

src/truthound/cli_modules/core/__init__.py

		"""Core CLI commands for Truthound.

		This package contains the fundamental CLI commands:
		- read: Read and preview data
		- learn: Learn schema from data files
		@@ -14,2 +15,3 @@ - check: Validate data quality

		from truthound.cli_modules.core.read import read_cmd
		from truthound.cli_modules.core.learn import learn_cmd
		@@ -32,2 +34,3 @@ from truthound.cli_modules.core.check import check_cmd
		"""
		parent_app.command(name="read")(read_cmd)
		parent_app.command(name="learn")(learn_cmd)
		@@ -44,2 +47,3 @@ parent_app.command(name="check")(check_cmd)
		"register_commands",
		"read_cmd",
		"learn_cmd",
		@@ -46,0 +50,0 @@ "check_cmd",

+141

-32

src/truthound/cli_modules/core/check.py

		"""Check command - Validate data quality.

		This module implements the `truthound check` command for validating
		data quality in files.
		This module implements the ``truthound check`` command for validating
		data quality in files and database tables.
		"""
		@@ -14,2 +14,10 @@

		from truthound.cli_modules.common.datasource import (
		ConnectionOpt,
		QueryOpt,
		SourceConfigOpt,
		SourceNameOpt,
		TableOpt,
		resolve_datasource,
		)
		from truthound.cli_modules.common.errors import error_boundary, require_file
		@@ -22,5 +30,14 @@ from truthound.cli_modules.common.options import parse_list_callback
		file: Annotated[
		Path,
		typer.Argument(help="Path to the data file"),
		],
		Optional[Path],
		typer.Argument(
		help="Path to the data file (CSV, JSON, Parquet, NDJSON)",
		),
		] = None,
		# -- DataSource Options --
		connection: ConnectionOpt = None,
		table: TableOpt = None,
		query: QueryOpt = None,
		source_config: SourceConfigOpt = None,
		source_name: SourceNameOpt = None,
		# -- Validator Options --
		validators: Annotated[
		@@ -74,2 +91,23 @@ Optional[list[str]],
		] = 1000,
		partial_unexpected_count: Annotated[
		int,
		typer.Option(
		"--partial-unexpected-count",
		help="Maximum number of unexpected values in partial list (BASIC+)",
		),
		] = 20,
		include_unexpected_index: Annotated[
		bool,
		typer.Option(
		"--include-unexpected-index",
		help="Include row index for each unexpected value in results",
		),
		] = False,
		return_debug_query: Annotated[
		bool,
		typer.Option(
		"--return-debug-query",
		help="Include Polars debug query expression in results (COMPLETE level)",
		),
		] = False,
		catch_exceptions: Annotated[
		@@ -109,7 +147,48 @@ bool,
		] = None,
		# -- Execution Options --
		parallel: Annotated[
		bool,
		typer.Option(
		"--parallel/--no-parallel",
		help=(
		"Enable DAG-based parallel execution. "
		"Validators are grouped by dependency level and executed concurrently."
		),
		),
		] = False,
		max_workers: Annotated[
		Optional[int],
		typer.Option(
		"--max-workers",
		help=(
		"Maximum worker threads for parallel execution. "
		"Only effective with --parallel. Defaults to min(32, cpu_count + 4)."
		),
		min=1,
		),
		] = None,
		pushdown: Annotated[
		Optional[bool],
		typer.Option(
		"--pushdown/--no-pushdown",
		help=(
		"Enable query pushdown for SQL data sources. "
		"Validation logic is executed server-side when possible. "
		"Default: auto-detect based on data source type."
		),
		),
		] = None,
		use_engine: Annotated[
		bool,
		typer.Option(
		"--use-engine/--no-use-engine",
		help="Use execution engine for validation (experimental).",
		),
		] = False,
		) -> None:
		"""Validate data quality in a file.
		"""Validate data quality in a file or database table.

		This command runs data quality validators on the specified file and
		reports any issues found.
		This command runs data quality validators on the specified data
		and reports any issues found. Supports file paths, database
		connections, and source config files.

		@@ -123,9 +202,7 @@ Examples:
		truthound check data.csv --result-format complete
		truthound check data.csv --rf boolean_only
		truthound check data.csv --no-catch-exceptions
		truthound check data.csv --max-retries 3
		truthound check data.csv --show-exceptions --format json
		truthound check --connection "postgresql://user:pass@host/db" --table users
		truthound check --conn "sqlite:///data.db" --table orders --pushdown
		truthound check --source-config db.yaml --strict
		truthound check data.csv --parallel --max-workers 8
		truthound check data.csv --exclude-columns first_name,last_name
		truthound check data.csv --validator-config '{"unique": {"exclude_columns": ["first_name"]}}'
		truthound check data.csv --validator-config config.json
		"""
		@@ -135,4 +212,12 @@ from truthound.api import check

		# Validate files exist
		require_file(file)
		# Resolve data source
		data_path, source = resolve_datasource(
		file=file,
		connection=connection,
		table=table,
		query=query,
		source_config=source_config,
		source_name=source_name,
		)

		if schema_file:
		@@ -192,24 +277,45 @@ require_file(schema_file, "Schema file")

		# Build result_format config
		rf_config: str \| ResultFormatConfig = result_format
		if include_unexpected_rows or max_unexpected_rows != 1000:
		# Build result_format config — include all fine-grained parameters
		has_custom_rf = (
		include_unexpected_rows
		or max_unexpected_rows != 1000
		or partial_unexpected_count != 20
		or include_unexpected_index
		or return_debug_query
		)
		rf_config: str \| ResultFormatConfig
		if has_custom_rf:
		rf_config = ResultFormatConfig(
		format=ResultFormat.from_string(result_format),
		partial_unexpected_count=partial_unexpected_count,
		include_unexpected_rows=include_unexpected_rows,
		max_unexpected_rows=max_unexpected_rows,
		include_unexpected_index=include_unexpected_index,
		return_debug_query=return_debug_query,
		)
		else:
		rf_config = result_format

		# Build API call kwargs
		check_kwargs: dict[str, Any] = {
		"validators": validator_list,
		"validator_config": v_config,
		"min_severity": min_severity,
		"schema": schema_file,
		"auto_schema": auto_schema,
		"result_format": rf_config,
		"catch_exceptions": catch_exceptions,
		"max_retries": max_retries,
		"exclude_columns": exclude_cols,
		"parallel": parallel,
		"max_workers": max_workers,
		"pushdown": pushdown,
		"use_engine": use_engine,
		}

		try:
		report = check(
		str(file),
		validators=validator_list,
		validator_config=v_config,
		min_severity=min_severity,
		schema=schema_file,
		auto_schema=auto_schema,
		result_format=rf_config,
		catch_exceptions=catch_exceptions,
		max_retries=max_retries,
		exclude_columns=exclude_cols,
		)
		if source is not None:
		report = check(source=source, **check_kwargs)
		else:
		report = check(data_path, **check_kwargs)
		except Exception as e:
		@@ -219,2 +325,5 @@ typer.echo(f"Error: {e}", err=True)

		# Determine label for HTML report title
		report_label = source_name or (source.name if source else str(file))

		# Output the report
		@@ -236,3 +345,3 @@ if format == "json":

		html = generate_html_report(report, title=f"Validation Report: {file.name}")
		html = generate_html_report(report, title=f"Validation Report: {report_label}")
		output.write_text(html, encoding="utf-8")
		@@ -239,0 +348,0 @@ typer.echo(f"HTML report written to {output}")

+84

-19

src/truthound/cli_modules/core/compare.py

		"""Compare command - Compare datasets for drift.

		This module implements the `truthound compare` command for detecting
		data drift between two datasets.
		This module implements the ``truthound compare`` command for detecting
		data drift between two datasets from files or database tables.
		"""
		@@ -14,3 +14,7 @@

		from truthound.cli_modules.common.errors import error_boundary, require_file
		from truthound.cli_modules.common.datasource import (
		SourceConfigOpt,
		resolve_compare_sources,
		)
		from truthound.cli_modules.common.errors import error_boundary
		from truthound.cli_modules.common.options import parse_list_callback
		@@ -22,9 +26,16 @@
		baseline: Annotated[
		Path,
		typer.Argument(help="Baseline (reference) data file"),
		],
		Optional[Path],
		typer.Argument(
		help="Baseline (reference) data file",
		),
		] = None,
		current: Annotated[
		Path,
		typer.Argument(help="Current data file to compare"),
		],
		Optional[Path],
		typer.Argument(
		help="Current data file to compare",
		),
		] = None,
		# -- DataSource Config (for database-to-database comparison) --
		source_config: SourceConfigOpt = None,
		# -- Compare Options --
		columns: Annotated[
		@@ -36,3 +47,10 @@ Optional[list[str]],
		str,
		typer.Option("--method", "-m", help="Detection method (auto, ks, psi, chi2, js)"),
		typer.Option(
		"--method",
		"-m",
		help=(
		"Detection method: auto, ks, psi, chi2, js, kl, wasserstein, "
		"cvm, anderson, hellinger, bhattacharyya, tv, energy, mmd"
		),
		),
		] = "auto",
		@@ -43,2 +61,11 @@ threshold: Annotated[
		] = None,
		sample_size: Annotated[
		Optional[int],
		typer.Option(
		"--sample-size",
		"--sample",
		help="Sample size for large datasets. Uses random sampling for faster comparison.",
		min=1,
		),
		] = None,
		format: Annotated[
		@@ -60,11 +87,29 @@ str,
		This command compares a baseline dataset with a current dataset and
		detects statistical drift in column distributions.
		detects statistical drift in column distributions. Supports file
		paths or a --source-config for database-to-database comparison.

		Detection Methods:
		- auto: Automatically select best method per column
		- auto: Automatically select best method per column (recommended)
		- ks: Kolmogorov-Smirnov test (numeric)
		- psi: Population Stability Index
		- psi: Population Stability Index (ML monitoring)
		- chi2: Chi-squared test (categorical)
		- js: Jensen-Shannon divergence
		- js: Jensen-Shannon divergence (any type)
		- kl: Kullback-Leibler divergence (numeric)
		- wasserstein: Earth Mover's distance (numeric)
		- cvm: Cramer-von Mises test (numeric, tail-sensitive)
		- anderson: Anderson-Darling test (numeric, extreme values)
		- hellinger: Hellinger distance (bounded metric)
		- bhattacharyya: Bhattacharyya distance (classification bounds)
		- tv: Total Variation distance (max probability diff)
		- energy: Energy distance (location/scale)
		- mmd: Maximum Mean Discrepancy (high-dimensional)

		Source Config Format (YAML):
		baseline:
		connection: "postgresql://user:pass@host/db"
		table: train_data
		current:
		connection: "postgresql://user:pass@host/db"
		table: production_data

		Examples:
		@@ -75,8 +120,15 @@ truthound compare baseline.csv current.csv
		truthound compare old.csv new.csv --columns price,quantity
		truthound compare --source-config drift_config.yaml --method ks
		truthound compare big_train.csv big_prod.csv --sample-size 10000
		"""
		from truthound.drift import compare

		# Validate files exist
		require_file(baseline, "Baseline file")
		require_file(current, "Current file")
		# Resolve both data sources
		(baseline_path, baseline_source), (current_path, current_source) = (
		resolve_compare_sources(
		baseline=baseline,
		current=current,
		source_config=source_config,
		)
		)

		@@ -86,9 +138,22 @@ # Parse columns if provided

		# Determine inputs for the compare API
		baseline_input = (
		baseline_source.to_polars_lazyframe().collect()
		if baseline_source
		else baseline_path
		)
		current_input = (
		current_source.to_polars_lazyframe().collect()
		if current_source
		else current_path
		)

		try:
		drift_report = compare(
		str(baseline),
		str(current),
		baseline_input,
		current_input,
		columns=column_list,
		method=method,
		threshold=threshold,
		sample_size=sample_size,
		)
		@@ -95,0 +160,0 @@ except Exception as e:

+62

-13

src/truthound/cli_modules/core/learn.py

		@@ -1,5 +0,5 @@
		"""Learn command - Learn schema from data files.
		"""Learn command - Learn schema from data.

		This module implements the `truthound learn` command for inferring
		schema from data files.
		This module implements the ``truthound learn`` command for inferring
		schema from data files and database tables.
		"""
		@@ -10,7 +10,15 @@
		from pathlib import Path
		from typing import Annotated
		from typing import Annotated, Optional

		import typer

		from truthound.cli_modules.common.errors import error_boundary, require_file
		from truthound.cli_modules.common.datasource import (
		ConnectionOpt,
		QueryOpt,
		SourceConfigOpt,
		SourceNameOpt,
		TableOpt,
		resolve_datasource,
		)
		from truthound.cli_modules.common.errors import error_boundary

		@@ -21,5 +29,14 @@
		file: Annotated[
		Path,
		typer.Argument(help="Path to the data file to learn from"),
		],
		Optional[Path],
		typer.Argument(
		help="Path to the data file to learn from",
		),
		] = None,
		# -- DataSource Options --
		connection: ConnectionOpt = None,
		table: TableOpt = None,
		query: QueryOpt = None,
		source_config: SourceConfigOpt = None,
		source_name: SourceNameOpt = None,
		# -- Schema Options --
		output: Annotated[
		@@ -33,6 +50,17 @@ Path,
		] = False,
		categorical_threshold: Annotated[
		int,
		typer.Option(
		"--categorical-threshold",
		help=(
		"Maximum unique values to treat a column as categorical "
		"during schema inference (default: 20)"
		),
		min=1,
		),
		] = 20,
		) -> None:
		"""Learn schema from a data file.
		"""Learn schema from a data file or database table.

		This command analyzes the data file and generates a schema definition
		This command analyzes the data and generates a schema definition
		that captures column types, constraints, and patterns.
		@@ -44,10 +72,31 @@
		truthound learn data.csv --no-constraints
		truthound learn data.csv --categorical-threshold 50
		truthound learn --connection "postgresql://user:pass@host/db" --table users
		truthound learn --source-config db.yaml -o db_schema.yaml
		"""
		from truthound.schema import learn

		# Validate file exists
		require_file(file)
		# Resolve data source
		data_path, source = resolve_datasource(
		file=file,
		connection=connection,
		table=table,
		query=query,
		source_config=source_config,
		source_name=source_name,
		)

		try:
		schema = learn(str(file), infer_constraints=not no_constraints)
		if source is not None:
		schema = learn(
		source=source,
		infer_constraints=not no_constraints,
		categorical_threshold=categorical_threshold,
		)
		else:
		schema = learn(
		data_path,
		infer_constraints=not no_constraints,
		categorical_threshold=categorical_threshold,
		)
		schema.save(output)
		@@ -54,0 +103,0 @@

+41

-13

src/truthound/cli_modules/core/mask.py

		"""Mask command - Mask sensitive data.

		This module implements the `truthound mask` command for masking
		sensitive data in files.
		This module implements the ``truthound mask`` command for masking
		sensitive data in files and database tables.
		"""
		@@ -14,3 +14,11 @@

		from truthound.cli_modules.common.errors import error_boundary, require_file
		from truthound.cli_modules.common.datasource import (
		ConnectionOpt,
		QueryOpt,
		SourceConfigOpt,
		SourceNameOpt,
		TableOpt,
		resolve_datasource,
		)
		from truthound.cli_modules.common.errors import error_boundary
		from truthound.cli_modules.common.options import parse_list_callback
		@@ -22,9 +30,18 @@
		file: Annotated[
		Path,
		typer.Argument(help="Path to the data file"),
		],
		Optional[Path],
		typer.Argument(
		help="Path to the data file (CSV, JSON, Parquet, NDJSON)",
		),
		] = None,
		# -- DataSource Options --
		connection: ConnectionOpt = None,
		table: TableOpt = None,
		query: QueryOpt = None,
		source_config: SourceConfigOpt = None,
		source_name: SourceNameOpt = None,
		# -- Mask Options --
		output: Annotated[
		Path,
		typer.Option("--output", "-o", help="Output file path"),
		],
		] = ...,
		columns: Annotated[
		@@ -46,5 +63,5 @@ Optional[list[str]],
		) -> None:
		"""Mask sensitive data in a file.
		"""Mask sensitive data in a file or database table.

		This command creates a copy of the data file with sensitive columns
		This command creates a copy of the data with sensitive columns
		masked using the specified strategy.
		@@ -61,3 +78,4 @@
		truthound mask data.csv -o masked.csv --strategy hash
		truthound mask data.csv -o masked.csv --columns email --strict
		truthound mask --connection "postgresql://user:pass@host/db" --table users -o masked.csv
		truthound mask --source-config db.yaml -o masked.parquet
		"""
		@@ -68,4 +86,11 @@ import warnings

		# Validate file exists
		require_file(file)
		# Resolve data source
		data_path, source = resolve_datasource(
		file=file,
		connection=connection,
		table=table,
		query=query,
		source_config=source_config,
		source_name=source_name,
		)

		@@ -79,3 +104,6 @@ # Parse columns if provided
		warnings.simplefilter("always", MaskingWarning)
		masked_df = mask(str(file), columns=column_list, strategy=strategy, strict=strict)
		if source is not None:
		masked_df = mask(source=source, columns=column_list, strategy=strategy, strict=strict)
		else:
		masked_df = mask(data_path, columns=column_list, strategy=strategy, strict=strict)

		@@ -82,0 +110,0 @@ # Display any warnings

+39

-10

src/truthound/cli_modules/core/profile.py

		"""Profile command - Generate data profiles.

		This module implements the `truthound profile` command for generating
		statistical profiles of data files.
		This module implements the ``truthound profile`` command for generating
		statistical profiles of data files and database tables.
		"""
		@@ -14,3 +14,11 @@

		from truthound.cli_modules.common.errors import error_boundary, require_file
		from truthound.cli_modules.common.datasource import (
		ConnectionOpt,
		QueryOpt,
		SourceConfigOpt,
		SourceNameOpt,
		TableOpt,
		resolve_datasource,
		)
		from truthound.cli_modules.common.errors import error_boundary

		@@ -21,5 +29,14 @@
		file: Annotated[
		Path,
		typer.Argument(help="Path to the data file"),
		],
		Optional[Path],
		typer.Argument(
		help="Path to the data file (CSV, JSON, Parquet, NDJSON)",
		),
		] = None,
		# -- DataSource Options --
		connection: ConnectionOpt = None,
		table: TableOpt = None,
		query: QueryOpt = None,
		source_config: SourceConfigOpt = None,
		source_name: SourceNameOpt = None,
		# -- Output Options --
		format: Annotated[
		@@ -36,3 +53,3 @@ str,

		This command analyzes the data file and generates statistics including:
		This command analyzes the data and generates statistics including:
		- Row and column counts
		@@ -48,10 +65,22 @@ - Null ratios per column
		truthound profile data.csv -o profile.json
		truthound profile --connection "postgresql://user:pass@host/db" --table users
		truthound profile --source-config db.yaml --format json
		"""
		from truthound.api import profile

		# Validate file exists
		require_file(file)
		# Resolve data source
		data_path, source = resolve_datasource(
		file=file,
		connection=connection,
		table=table,
		query=query,
		source_config=source_config,
		source_name=source_name,
		)

		try:
		profile_report = profile(str(file))
		if source is not None:
		profile_report = profile(source=source)
		else:
		profile_report = profile(data_path)
		except Exception as e:
		@@ -58,0 +87,0 @@ typer.echo(f"Error: {e}", err=True)

+41

-12

src/truthound/cli_modules/core/scan.py

		"""Scan command - Scan for PII.

		This module implements the `truthound scan` command for detecting
		personally identifiable information in data files.
		This module implements the ``truthound scan`` command for detecting
		personally identifiable information in data files and database tables.
		"""
		@@ -14,3 +14,11 @@

		from truthound.cli_modules.common.errors import error_boundary, require_file
		from truthound.cli_modules.common.datasource import (
		ConnectionOpt,
		QueryOpt,
		SourceConfigOpt,
		SourceNameOpt,
		TableOpt,
		resolve_datasource,
		)
		from truthound.cli_modules.common.errors import error_boundary

		@@ -21,5 +29,14 @@
		file: Annotated[
		Path,
		typer.Argument(help="Path to the data file"),
		],
		Optional[Path],
		typer.Argument(
		help="Path to the data file (CSV, JSON, Parquet, NDJSON)",
		),
		] = None,
		# -- DataSource Options --
		connection: ConnectionOpt = None,
		table: TableOpt = None,
		query: QueryOpt = None,
		source_config: SourceConfigOpt = None,
		source_name: SourceNameOpt = None,
		# -- Output Options --
		format: Annotated[
		@@ -36,3 +53,3 @@ str,

		This command analyzes data files to detect columns that may contain
		This command analyzes data to detect columns that may contain
		PII such as names, emails, phone numbers, SSNs, etc.
		@@ -44,11 +61,22 @@
		truthound scan data.csv -o pii_report.json
		truthound scan data.csv --format html -o pii_report.html
		truthound scan --connection "postgresql://user:pass@host/db" --table users
		truthound scan --source-config db.yaml --format json
		"""
		from truthound.api import scan

		# Validate file exists
		require_file(file)
		# Resolve data source
		data_path, source = resolve_datasource(
		file=file,
		connection=connection,
		table=table,
		query=query,
		source_config=source_config,
		source_name=source_name,
		)

		try:
		pii_report = scan(str(file))
		if source is not None:
		pii_report = scan(source=source)
		else:
		pii_report = scan(data_path)
		except Exception as e:
		@@ -73,4 +101,5 @@ typer.echo(f"Error: {e}", err=True)

		report_label = source_name or (source.name if source else str(file))
		html = generate_pii_html_report(
		pii_report, title=f"PII Scan Report: {file.name}"
		pii_report, title=f"PII Scan Report: {report_label}"
		)
		@@ -77,0 +106,0 @@ output.write_text(html, encoding="utf-8")

+127

-18

src/truthound/datasources/factory.py

		@@ -190,3 +190,13 @@ """Factory functions for creating data sources.
		if isinstance(data, str):
		if data.startswith(("postgresql://", "postgres://")):
		sql_prefixes = (
		"postgresql://", "postgres://", "mysql://",
		"sqlite:", "duckdb:", "mssql://", "sqlserver://",
		)
		sql_suffixes = (".db", ".duckdb")
		is_sql = (
		data.startswith(sql_prefixes)
		or data.endswith(sql_suffixes)
		or "redshift.amazonaws.com" in data
		)
		if is_sql:
		table = kwargs.pop("table", None)
		@@ -197,18 +207,4 @@ if not table:
		)
		from truthound.datasources.sql import PostgreSQLDataSource
		return PostgreSQLDataSource.from_connection_string(
		data, table=table, **kwargs
		)
		return get_sql_datasource(data, table=table, **kwargs)

		if data.startswith("mysql://"):
		table = kwargs.pop("table", None)
		if not table:
		raise DataSourceError(
		"SQL connection string requires 'table' parameter"
		)
		from truthound.datasources.sql import MySQLDataSource
		return MySQLDataSource.from_connection_string(
		data, table=table, **kwargs
		)

		# File doesn't exist
		@@ -262,2 +258,11 @@ if not path.exists():

		# SQLite: URI format (sqlite:///path) or file path (.db)
		if connection_string.startswith("sqlite:"):
		# sqlite:///path/to/db or sqlite:///:memory:
		db_path = connection_string.replace("sqlite:///", "").replace("sqlite://", "")
		if not db_path:
		db_path = ":memory:"
		from truthound.datasources.sql import SQLiteDataSource
		return SQLiteDataSource(table=table, database=db_path, **kwargs)

		if connection_string.endswith(".db") or connection_string == ":memory:":
		@@ -267,2 +272,24 @@ from truthound.datasources.sql import SQLiteDataSource

		# DuckDB: URI format (duckdb:///path) or file suffix (.duckdb)
		if connection_string.startswith("duckdb:") or connection_string.endswith(".duckdb"):
		try:
		from truthound.datasources.sql import DuckDBDataSource
		except ImportError:
		raise DataSourceError(
		"DuckDB support requires duckdb. "
		"Install with: pip install duckdb"
		)
		if DuckDBDataSource is None:
		raise DataSourceError(
		"DuckDB support requires duckdb. "
		"Install with: pip install duckdb"
		)
		if connection_string.startswith("duckdb:"):
		db_path = connection_string.replace("duckdb:///", "").replace("duckdb://", "")
		if not db_path:
		db_path = ":memory:"
		else:
		db_path = connection_string
		return DuckDBDataSource(table=table, database=db_path, **kwargs)

		# Oracle
		@@ -312,3 +339,4 @@ if connection_string.startswith("oracle://") or "oracle" in connection_string.lower():
		f"Unsupported SQL connection string format: {connection_string}. "
		"Supported: postgresql://, mysql://, mssql://, SQLite file path. "
		"Supported: postgresql://, mysql://, sqlite:///path, duckdb:///path, "
		"mssql://, sqlserver://, .db, .duckdb. "
		"For BigQuery, Snowflake, Redshift, Databricks, use their specific classes."
		@@ -351,4 +379,8 @@ )
		return "mysql"
		if data.endswith(".db") or data == ":memory:":
		if data.startswith("sqlite:") or data.endswith(".db") or data == ":memory:":
		return "sqlite"
		if data.startswith("duckdb:") or data.endswith(".duckdb"):
		return "duckdb"
		if data.startswith(("mssql://", "sqlserver://")):
		return "sqlserver"
		return "unknown"
		@@ -440,1 +472,78 @@
		return DictDataSource(data)


		def get_datasource_from_config(config: dict[str, Any]) -> DataSourceProtocol:
		"""Create a DataSource from a configuration dictionary.

		Convenience function for creating data sources from parsed
		configuration files (JSON/YAML). Delegates to ``get_sql_datasource()``
		for connection-string-based configs or constructs backend-specific
		classes from individual parameters.

		Config styles supported:

		Connection string::

		{"connection": "postgresql://user:pass@host/db", "table": "users"}

		Individual parameters::

		{"type": "postgresql", "host": "localhost", "database": "mydb",
		"user": "postgres", "password": "...", "table": "users"}

		Args:
		config: Configuration dictionary with connection details.

		Returns:
		Configured DataSource instance.

		Raises:
		DataSourceError: If the config is invalid or backend unavailable.
		"""
		connection = config.get("connection")
		table = config.get("table")
		query = config.get("query")
		source_type = config.get("type")

		# Style 1: Connection string
		if connection:
		if not table and not query:
		raise DataSourceError(
		"Config with 'connection' requires 'table' or 'query'."
		)
		return get_sql_datasource(
		connection, table=table or "__query__", query=query
		)

		# Style 2: Individual parameters
		if not source_type:
		raise DataSourceError(
		"Config must have either 'connection' or 'type' field."
		)

		if not table and not query:
		raise DataSourceError(
		f"Config for type '{source_type}' requires 'table' or 'query'."
		)

		from truthound.datasources.sql import get_available_sources

		available = get_available_sources()
		source_cls = available.get(source_type)
		if source_cls is None:
		available_names = [k for k, v in available.items() if v is not None]
		raise DataSourceError(
		f"Data source type '{source_type}' is not available. "
		f"Available: {', '.join(available_names)}."
		)

		# Build kwargs (exclude meta keys)
		meta_keys = {"type", "table", "query", "name"}
		kwargs: dict[str, Any] = {"table": table} if table else {}
		if query:
		kwargs["query"] = query
		for key, value in config.items():
		if key not in meta_keys:
		kwargs[key] = value

		return source_cls(**kwargs)

+47

-0

tests/cli_modules/test_check_options.py

		@@ -183,2 +183,49 @@ """Tests for check command --exclude-columns and --validator-config options."""

		class TestCheckDatasourceOptions:
		"""Tests for --connection, --table, and --source-config on check."""

		def test_check_with_connection_string(self, runner, app):
		"""--connection + --table passes source to API."""
		with (
		patch("truthound.datasources.factory.get_sql_datasource") as mock_sql,
		patch("truthound.api.check") as mock_check,
		):
		import polars as pl
		mock_source = MagicMock()
		mock_source.name = "users"
		mock_source.to_polars_lazyframe.return_value = pl.LazyFrame({"id": [1]})
		mock_sql.return_value = mock_source

		mock_report = MagicMock()
		mock_report.has_issues = False
		mock_report.exception_summary = None
		mock_check.return_value = mock_report

		result = runner.invoke(app, [
		"--connection", "postgresql://user:pass@host/db",
		"--table", "users",
		])

		assert result.exit_code == 0
		# check() should be called with source= keyword
		call_kwargs = mock_check.call_args
		assert call_kwargs.kwargs.get("source") is not None

		def test_check_file_and_connection_mutually_exclusive(self, runner, app, sample_csv):
		"""Both file and --connection raises error."""
		result = runner.invoke(app, [
		str(sample_csv),
		"--connection", "postgresql://host/db",
		"--table", "t",
		])
		assert result.exit_code != 0

		def test_check_connection_without_table_error(self, runner, app):
		"""--connection without --table raises error."""
		result = runner.invoke(app, [
		"--connection", "postgresql://host/db",
		])
		assert result.exit_code != 0


		class TestCombinedOptions:
		@@ -185,0 +232,0 @@ """Tests for combined --exclude-columns and --validator-config."""

-224

.claude/worktrees/condescending-chaplygin/pyproject.toml

		[build-system]
		requires = ["hatchling"]
		build-backend = "hatchling.build"

		[project]
		name = "truthound"
		version = "1.3.2"
		description = "Zero-Configuration Data Quality Framework Powered by Polars"
		readme = "README.md"
		license = "Apache-2.0"
		requires-python = ">=3.11"
		authors = [
		{ name = "seadonggyun4", email = "seadonggyun4@gmail.com" }
		]
		keywords = [
		"data-quality",
		"data-validation",
		"polars",
		"pii-detection",
		"data-masking",
		]
		classifiers = [
		"Development Status :: 3 - Alpha",
		"Intended Audience :: Developers",
		"Intended Audience :: Science/Research",
		"License :: OSI Approved :: Apache Software License",
		"Operating System :: OS Independent",
		"Programming Language :: Python :: 3",
		"Programming Language :: Python :: 3.11",
		"Programming Language :: Python :: 3.12",
		"Topic :: Scientific/Engineering",
		"Topic :: Software Development :: Libraries :: Python Modules",
		"Typing :: Typed",
		]
		dependencies = [
		"polars>=1.0.0",
		"pyyaml>=6.0.0",
		"rich>=13.0.0",
		"typer>=0.12.0",
		]

		[project.optional-dependencies]
		# Report generation
		reports = [
		"jinja2>=3.0.0",
		]

		# Statistical drift detection
		drift = [
		"scipy>=1.10.0",
		]

		# Anomaly detection with ML
		anomaly = [
		"scipy>=1.10.0",
		"scikit-learn>=1.3.0",
		]

		# Cloud storage backends
		s3 = [
		"boto3>=1.26.0",
		]
		gcs = [
		"google-cloud-storage>=2.0.0",
		]
		azure = [
		"azure-storage-blob>=12.0.0",
		]

		# Database storage backend
		database = [
		"sqlalchemy>=2.0.0",
		]

		# All storage backends
		stores = [
		"boto3>=1.26.0",
		"google-cloud-storage>=2.0.0",
		"azure-storage-blob>=12.0.0",
		"sqlalchemy>=2.0.0",
		]

		# DuckDB database
		duckdb = [
		"duckdb>=1.0.0",
		]

		# NoSQL databases
		mongodb = [
		"motor>=3.0.0",
		]
		elasticsearch = [
		"elasticsearch[async]>=8.0.0",
		]
		nosql = [
		"motor>=3.0.0",
		"elasticsearch[async]>=8.0.0",
		]

		# Streaming platforms
		kafka = [
		"aiokafka>=0.9.0",
		]
		streaming = [
		"aiokafka>=0.9.0",
		]

		# All async datasources
		async-datasources = [
		"motor>=3.0.0",
		"elasticsearch[async]>=8.0.0",
		"aiokafka>=0.9.0",
		]

		# Interactive dashboard (Phase 8)
		dashboard = [
		"reflex>=0.4.0",
		]

		# PDF export support
		pdf = [
		"weasyprint>=60.0",
		]

		# Performance optimization
		perf = [
		"xxhash>=3.4.0",
		]

		# Full installation with all optional dependencies
		all = [
		"jinja2>=3.0.0",
		"pandas>=2.0.0",
		"scipy>=1.10.0",
		"scikit-learn>=1.3.0",
		"boto3>=1.26.0",
		"google-cloud-storage>=2.0.0",
		"azure-storage-blob>=12.0.0",
		"sqlalchemy>=2.0.0",
		"duckdb>=1.0.0",
		"reflex>=0.4.0",
		"weasyprint>=60.0",
		"motor>=3.0.0",
		"elasticsearch[async]>=8.0.0",
		"aiokafka>=0.9.0",
		"xxhash>=3.4.0",
		]

		# Development dependencies
		dev = [
		"pytest>=8.0.0",
		"pytest-cov>=4.0.0",
		"pytest-asyncio>=0.23.0",
		"ruff>=0.4.0",
		"mypy>=1.10.0",
		"pandas>=2.0.0",
		"scipy>=1.10.0",
		"scikit-learn>=1.3.0",
		]

		[project.scripts]
		truthound = "truthound.cli:app"

		[project.urls]
		Homepage = "https://github.com/seadonggyun4/Truthound"
		Repository = "https://github.com/seadonggyun4/Truthound"
		Issues = "https://github.com/seadonggyun4/Truthound/issues"

		[tool.hatch.build.targets.wheel]
		packages = ["src/truthound"]

		[tool.hatch.envs.default]
		dependencies = [
		"pytest>=8.0.0",
		"pytest-cov>=4.0.0",
		"ruff>=0.4.0",
		"mypy>=1.10.0",
		"pandas>=2.0.0",
		]

		[tool.hatch.envs.default.scripts]
		test = "pytest {args:tests}"
		test-cov = "pytest --cov=truthound --cov-report=term-missing {args:tests}"
		lint = "ruff check src tests"
		format = "ruff format src tests"
		typecheck = "mypy src"

		[tool.ruff]
		target-version = "py311"
		line-length = 100
		src = ["src", "tests"]

		[tool.ruff.lint]
		select = [
		"E", # pycodestyle errors
		"W", # pycodestyle warnings
		"F", # Pyflakes
		"I", # isort
		"UP", # pyupgrade
		"B", # flake8-bugbear
		"SIM", # flake8-simplify
		"TCH", # flake8-type-checking
		]
		ignore = [
		"E501", # line too long (handled by formatter)
		]

		[tool.ruff.lint.isort]
		known-first-party = ["truthound"]

		[tool.mypy]
		python_version = "3.11"
		strict = true
		warn_return_any = true
		warn_unused_ignores = true

		[tool.pytest.ini_options]
		testpaths = ["tests"]
		pythonpath = ["src"]
		markers = [
		"slow: marks tests as slow (deselect with '-m \"not slow\"')",
		"e2e: marks tests as end-to-end tests",
		"scale_100m: marks tests as 100M+ scale tests (run with '-m scale_100m')",
		]

		@@ -8,3 +8,3 @@ # truthound mask
		```bash
		truthound mask <file> -o <output> [OPTIONS]
		truthound mask [FILE] -o <output> [OPTIONS]
		```
		@@ -16,4 +16,14 @@
		\|----------\|----------\|-------------\|
		\| `file` \| Yes \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|
		\| `file` \| No \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|

		## Data Source Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--connection` \| `--conn` \| None \| Database connection string \|
		\| `--table` \| \| None \| Database table name \|
		\| `--query` \| \| None \| SQL query (alternative to `--table`) \|
		\| `--source-config` \| `--sc` \| None \| Path to data source config file (JSON/YAML) \|
		\| `--source-name` \| \| None \| Custom label for the data source \|

		## Options
		@@ -20,0 +30,0 @@

		@@ -8,3 +8,3 @@ # truthound profile
		```bash
		truthound profile <file> [OPTIONS]
		truthound profile [FILE] [OPTIONS]
		```
		@@ -16,4 +16,14 @@
		\|----------\|----------\|-------------\|
		\| `file` \| Yes \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|
		\| `file` \| No \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|

		## Data Source Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--connection` \| `--conn` \| None \| Database connection string \|
		\| `--table` \| \| None \| Database table name \|
		\| `--query` \| \| None \| SQL query (alternative to `--table`) \|
		\| `--source-config` \| `--sc` \| None \| Path to data source config file (JSON/YAML) \|
		\| `--source-name` \| \| None \| Custom label for the data source \|

		## Options
		@@ -20,0 +30,0 @@

		@@ -8,3 +8,3 @@ # truthound scan
		```bash
		truthound scan <file> [OPTIONS]
		truthound scan [FILE] [OPTIONS]
		```
		@@ -16,4 +16,14 @@
		\|----------\|----------\|-------------\|
		\| `file` \| Yes \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|
		\| `file` \| No \| Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) \|

		## Data Source Options

		\| Option \| Short \| Default \| Description \|
		\|--------\|-------\|---------\|-------------\|
		\| `--connection` \| `--conn` \| None \| Database connection string \|
		\| `--table` \| \| None \| Database table name \|
		\| `--query` \| \| None \| SQL query (alternative to `--table`) \|
		\| `--source-config` \| `--sc` \| None \| Path to data source config file (JSON/YAML) \|
		\| `--source-name` \| \| None \| Custom label for the data source \|

		## Options
		@@ -20,0 +30,0 @@

truthound - pypi Package Compare versions

Improved metrics