
Security News
New React Server Components Vulnerabilities: DoS and Source Code Exposure
New DoS and source code exposure bugs in React Server Components and Next.js: what’s affected and how to update safely.
duckalog
Advanced tools
--8<-- "docs/_snippets/intro-quickstart.md"
Ready to try examples? See the examples/ directory for hands-on learning:
Requirements: Python 3.12 or newer
pip install duckalog
This installs the Python package and provides the duckalog CLI command.
For the web UI dashboard, install with optional UI dependencies:
pip install duckalog[ui]
The duckalog[ui] extra includes these core dependencies:
starlette>=0.27.0): ASGI web frameworkdatastar-python>=0.1.0): Reactive web frameworkuvicorn[standard]>=0.20.0): ASGI serverThe web UI uses Datastar for reactive, real-time updates:
The bundled Datastar JavaScript is served from /static/datastar.js and works offline without external network access.
For better YAML formatting preservation, install optional dependency:
pip install duckalog[ui,yaml]
# or
pip install ruamel.yaml>=0.17.0
This provides:
duckalog --help
duckalog --version
Development installation:
git clone https://github.com/legout/duckalog.git
cd duckalog
pip install -e .
Using uv (recommended for development):
uv pip install duckalog
For loading configuration files from remote storage systems and exporting catalogs to cloud storage (S3, GCS, Azure, SFTP), install with remote dependencies:
pip install duckalog[remote]
The duckalog[remote] extra includes support for:
fsspec>=2023.6.0): Unified filesystem interface for remote storagerequests>=2.28.0): HTTP/HTTPS support for remote configsFor specific cloud storage backends, install additional extras:
# AWS S3 support
pip install duckalog[remote-s3]
# Google Cloud Storage support
pip install duckalog[remote-gcs]
# Azure Blob Storage support
pip install duckalog[remote-azure]
# SFTP/SSH support
pip install duckalog[remote-sftp]
Remote configurations use standard authentication methods for each backend:
~/.aws/credentials, or IAM roleGOOGLE_APPLICATION_CREDENTIALS or Application Default CredentialsNote: Credentials are not embedded in URIs for security. Use standard authentication methods for each cloud provider.
Create a file catalog.yaml:
version: 1
duckdb:
database: catalog.duckdb
pragmas:
- "SET memory_limit='1GB'"
views:
- name: users
source: parquet
uri: "s3://my-bucket/data/users/*.parquet"
Duckalog makes it easy to get started with the init command, which generates a basic configuration template with educational examples:
# Create a basic YAML config (default)
duckalog init
# Create a JSON config with custom filename
duckalog init --format json --output my_config.json
# Create with custom database and project names
duckalog init --database sales.db --project sales_analytics
# Force overwrite existing file
duckalog init --force
The generated config includes:
duckalog build catalog.yaml
This will:
catalog.yaml.catalog.duckdb (creating it if necessary).users view.duckalog generate-sql catalog.yaml --output create_views.sql
create_views.sql will contain CREATE OR REPLACE VIEW statements for all
views defined in the config.
Export built DuckDB catalogs directly to cloud storage:
# Export to Amazon S3
duckalog build catalog.yaml --db-path s3://my-bucket/catalogs/analytics.duckdb
# Export to Google Cloud Storage
duckalog build catalog.yaml --db-path gs://my-project-bucket/catalogs/analytics.duckdb
# Export to Azure Blob Storage
duckalog build catalog.yaml --db-path abfs://account@container/catalogs/analytics.duckdb \
--azure-connection-string "DefaultEndpointsProtocol=https;AccountName=..."
# Export to SFTP server
duckalog build catalog.yaml --db-path sftp://server/path/catalogs/analytics.duckdb \
--sftp-host server.com --sftp-key-file ~/.ssh/id_rsa
# Export with custom authentication
duckalog build catalog.yaml --db-path s3://secure-bucket/catalogs/analytics.duckdb \
--fs-key AKIAIOSFODNN7EXAMPLE \
--fs-secret wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Remote export uses the same authentication patterns as remote configuration loading:
# AWS profiles (recommended)
duckalog build catalog.yaml --db-path s3://bucket/catalog.duckdb --aws-profile production
# Google Cloud service accounts
duckalog build catalog.yaml --db-path gs://bucket/catalog.duckdb \
--gcs-credentials-file /path/to/service-account.json
# Azure managed identities or connection strings
duckalog build catalog.yaml --db-path abfs://account@container/catalog.duckdb \
--azure-connection-string "DefaultEndpointsProtocol=https;..."
# SFTP with SSH keys
duckalog build catalog.yaml --db-path sftp://server/path/catalog.duckdb \
--sftp-host server.com --sftp-key-file ~/.ssh/id_rsa
duckalog validate catalog.yaml
This parses and validates config (including env interpolation), without connecting to DuckDB.
# Try multi-source analytics
cd examples/data-integration/multi-source-analytics
python data/generate.py
duckalog build catalog.yaml
# Try environment variables security
cd examples/production-operations/environment-variables-security
python generate-test-data.py
python validate-configs.py dev
# Try DuckDB performance tuning
cd examples/production-operations/duckdb-performance-settings
python generate-datasets.py --size small
duckalog build catalog-limited.yaml
Duckalog supports loading configuration files directly from remote storage systems:
# Load config from S3
duckalog build s3://my-bucket/configs/catalog.yaml
# Load config from Google Cloud Storage
duckalog validate gs://my-project/configs/catalog.yaml
# Load config from Azure Blob Storage
duckalog generate-sql abfs://my-account@my-container/configs/catalog.yaml
# Load config from HTTPS URL
duckalog build https://raw.githubusercontent.com/user/repo/main/catalog.yaml
# Load config from SFTP server
duckalog validate sftp://user@server/path/configs/catalog.yaml
Remote Configuration Features:
${env:VAR} patterns work with remote configsLimitations:
For advanced authentication scenarios, you can pass pre-configured fsspec filesystem objects directly to the Python API or use CLI options for dynamic filesystem creation.
Duckalog supports authentication for all major cloud storage providers through custom filesystems:
| Provider | Protocol | Authentication Methods | CLI Options |
|---|---|---|---|
| Amazon S3 | s3:// | AWS credentials, profiles, IAM roles | --fs-key/--fs-secret, --aws-profile |
| Google Cloud Storage | gs:// | Service accounts, ADC | --gcs-credentials-file |
| Azure Blob Storage | abfs:// | Connection strings, account keys | --azure-connection-string |
| GitHub | github:// | Personal access tokens, username/password | --fs-token, --fs-key/--fs-secret |
| SFTP | sftp:// | SSH keys, passwords, key files | --sftp-host, --sftp-key-file |
import fsspec
from duckalog import load_config
# S3 with direct credentials
fs = fsspec.filesystem("s3",
key="AKIAIOSFODNN7EXAMPLE",
secret="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
)
config = load_config("s3://my-bucket/config.yaml", filesystem=fs)
# GitHub with personal access token
fs = fsspec.filesystem("github", token="ghp_xxxxxxxxxxxxxxxxxxxx")
config = load_config("github://user/repo/config.yaml", filesystem=fs)
# Azure with connection string
fs = fsspec.filesystem("abfs",
connection_string="DefaultEndpointsProtocol=https;AccountName=account;AccountKey=key;EndpointSuffix=core.windows.net"
)
config = load_config("abfs://account@container/config.yaml", filesystem=fs)
# SFTP with SSH key
fs = fsspec.filesystem("sftp",
host="sftp.example.com",
username="user",
private_key="~/.ssh/id_rsa"
)
config = load_config("sftp://user@sftp.example.com/path/config.yaml", filesystem=fs)
# Google Cloud with service account
fs = fsspec.filesystem("gcs", token="/path/to/service-account.json")
config = load_config("gs://my-bucket/config.yaml", filesystem=fs)
# S3 with direct credentials
duckalog build s3://bucket/config.yaml \
--fs-key AKIAIOSFODNN7EXAMPLE \
--fs-secret wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
--fs-timeout 60
# S3 with AWS profile (recommended for production)
duckalog build s3://bucket/config.yaml \
--aws-profile myprofile
# GitHub with personal access token
duckalog validate github://user/repo/config.yaml \
--fs-token ghp_xxxxxxxxxxxxxxxxxxxx
# Azure with connection string
duckalog generate-sql abfs://account@container/config.yaml \
--azure-connection-string "DefaultEndpointsProtocol=https;AccountName=account;AccountKey=key"
# Azure with account key authentication
duckalog build abfs://account@container/config.yaml \
--fs-key myaccountname \
--fs-secret myaccountkey
# SFTP with SSH key file
duckalog build sftp://user@server/path/config.yaml \
--sftp-host sftp.example.com \
--sftp-key-file ~/.ssh/id_rsa \
--sftp-port 22
# SFTP with password authentication
duckalog build sftp://user@server/path/config.yaml \
--sftp-host sftp.example.com \
--fs-key username \
--fs-secret password
# Google Cloud with service account file
duckalog validate gs://my-project/config.yaml \
--gcs-credentials-file /path/to/service-account.json
# Anonymous access (public S3 buckets)
duckalog build s3://public-bucket/config.yaml \
--fs-anon true
# HTTP/HTTPS (no authentication needed)
duckalog build https://raw.githubusercontent.com/user/repo/main/config.yaml
The CLI can automatically infer the filesystem protocol from the provided options:
# These commands will automatically use the correct protocol:
duckalog build s3://bucket/config.yaml --aws-profile myprofile # → S3
duckalog build gs://bucket/config.yaml --gcs-credentials-file file.json # → GCS
duckalog build github://user/repo/config.yaml --fs-token token # → GitHub
duckalog build sftp://server/config.yaml --sftp-host server.com # → SFTP
Duckalog provides comprehensive validation for filesystem options:
# Examples of helpful error messages
duckalog build s3://bucket/config.yaml --aws-profile myprofile --fs-key key
# Error: Cannot specify both --aws-profile and --fs-key
duckalog build sftp://server/config.yaml --sftp-key-file missing.txt
# Error: SFTP key file not found: missing.txt
duckalog build s3://bucket/config.yaml --fs-anon true --fs-key key
# Error: S3 with anonymous access doesn't require credentials
duckalog build gs://bucket/config.yaml --gcs-credentials-file invalid.json
# Error: GCS credentials file not found: invalid.json
| Use Case | Recommended Method | Reason |
|---|---|---|
| Production deployments | Environment variables | Most secure, no credentials in code |
| CI/CD pipelines | Custom filesystems | Secure credential injection |
| Local development | Environment variables or profiles | Easy and secure |
| Testing | Custom filesystems | Easy to mock and test |
| One-off commands | CLI options | Convenient for ad-hoc usage |
Common Issues:
"fsspec is required" error
pip install duckalog[remote] # Install with remote dependencies
Authentication failures
# Check credentials are correct
# Verify cloud provider permissions
# Test connectivity with cloud provider tools
Timeout issues
# Increase timeout for slow connections
duckalog build s3://bucket/config.yaml --fs-timeout 120
Protocol inference not working
# Explicitly specify protocol
duckalog build s3://bucket/config.yaml --fs-protocol s3 --fs-key key --fs-secret secret
duckalog ui catalog.yaml
Note: The web UI currently only supports local configuration files. For remote configs, download them locally first:
# Download remote config locally
curl -o catalog.yaml https://raw.githubusercontent.com/user/repo/main/catalog.yaml
# Then use with UI
duckalog ui catalog.yaml
This starts a secure, reactive web-based dashboard at http://127.0.0.1:8000 with:
# Set admin token for production security
export DUCKALOG_ADMIN_TOKEN="your-secure-random-token"
duckalog ui catalog.yaml --host 0.0.0.0 --port 8000
Dependencies: Requires duckalog[ui] installation for Datastar and Starlette dependencies.
Security: See docs/SECURITY.md for comprehensive security documentation.
The duckalog package exposes the same functionality as the CLI with
convenience functions:
from duckalog import build_catalog, generate_sql, validate_config
from duckalog.config_init import create_config_template
# Generate a basic configuration template
content = create_config_template(format="yaml")
print(content)
# Save a template to a file
create_config_template(
format="yaml",
output_path="my_config.yaml",
database_name="analytics.db",
project_name="my_project"
)
# Build or update a catalog file in place
build_catalog("catalog.yaml")
# Generate SQL without executing it
sql = generate_sql("catalog.yaml")
print(sql)
# Validate config (raises ConfigError on failure)
validate_config("catalog.yaml")
You can also work directly with the Pydantic model:
from duckalog import load_config
config = load_config("catalog.yaml")
for view in config.views:
print(view.name, view.source)
At a high level, configs follow this structure:
version: 1
duckdb:
database: catalog.duckdb
install_extensions: []
load_extensions: []
pragmas: []
attachments:
duckdb:
- alias: refdata
path: ./refdata.duckdb
read_only: true
sqlite:
- alias: legacy
path: ./legacy.db
postgres:
- alias: dw
host: "${env:PG_HOST}"
port: 5432
database: dw
user: "${env:PG_USER}"
password: "${env:PG_PASSWORD}"
iceberg_catalogs:
- name: main_ic
catalog_type: rest
uri: "https://iceberg-catalog.internal"
warehouse: "s3://my-warehouse/"
options:
token: "${env:ICEBERG_TOKEN}"
views:
# Parquet view
- name: users
source: parquet
uri: "s3://my-bucket/data/users/*.parquet"
# Delta view
- name: events_delta
source: delta
uri: "s3://my-bucket/delta/events"
# Iceberg catalog-based view
- name: ic_orders
source: iceberg
catalog: main_ic
table: analytics.orders
# Attached DuckDB view
- name: ref_countries
source: duckdb
database: refdata
table: reference.countries
# Raw SQL view
- name: vip_users
sql: |
SELECT *
FROM users
WHERE is_vip = TRUE
semantic_models:
# Business-friendly semantic model on top of existing view
- name: sales_analytics
base_view: sales_data
label: "Sales Analytics"
description: "Business metrics for sales analysis"
tags: ["sales", "revenue"]
dimensions:
- name: order_date
expression: "created_at::date"
label: "Order Date"
type: "date"
- name: customer_region
expression: "UPPER(customer_region)"
label: "Customer Region"
type: "string"
measures:
- name: total_revenue
expression: "SUM(amount)"
label: "Total Revenue"
type: "number"
- name: order_count
expression: "COUNT(*)"
label: "Order Count"
type: "number"
Semantic models provide business-friendly metadata on top of existing views. v1 is metadata-only - no new DuckDB views are created, and no automatic query generation is performed.
Key limitations in v1:
Use semantic models to:
Semantic layer v2 extends v1 with joins, time dimensions, and defaults while maintaining full backward compatibility.
New v2 features:
semantic_models:
- name: sales_analytics
base_view: sales_data
label: "Sales Analytics"
# v2: Joins to dimension views
joins:
- to_view: customers
type: left
on_condition: "sales.customer_id = customers.id"
- to_view: products
type: left
on_condition: "sales.product_id = products.id"
dimensions:
# v2: Time dimension with time grains
- name: order_date
expression: "created_at"
type: "time"
time_grains: ["year", "quarter", "month", "day"]
label: "Order Date"
- name: customer_region
expression: "customers.region"
type: "string"
label: "Customer Region"
measures:
- name: total_revenue
expression: "SUM(sales.amount)"
label: "Total Revenue"
type: "number"
# v2: Default configuration
defaults:
time_dimension: order_date
primary_measure: total_revenue
default_filters:
- dimension: customer_region
operator: "="
value: "NORTH AMERICA"
Backward Compatibility:
See the examples/semantic_layer_v2 directory for a complete example demonstrating all v2 features.
Any string value may contain ${env:VAR_NAME} placeholders. During
load_config, these are resolved using os.environ["VAR_NAME"]. Missing
variables cause a ConfigError.
Examples:
duckdb:
pragmas:
- "SET s3_access_key_id='${env:AWS_ACCESS_KEY_ID}'"
- "SET s3_secret_access_key='${env:AWS_SECRET_ACCESS_KEY}'"
Duckalog automatically resolves relative paths to absolute paths, ensuring consistent behavior regardless of where Duckalog is executed from.
"data/file.parquet" are automatically resolved relative to the configuration file's directory"/absolute/path/file.parquet" or "C:\path\file.parquet") are preserved unchangeds3://, gs://, http://) and database connections are not modified"../../../etc/passwd")# Relative paths (recommended)
views:
- name: users
source: parquet
uri: "data/users.parquet" # Resolved to: /path/to/config/data/users.parquet
description: "User data relative to config location"
- name: events
source: parquet
uri: "../shared/events.parquet" # Resolved to: /path/to/../shared/events.parquet
description: "Shared data from parent directory"
# Absolute paths (still supported)
views:
- name: fixed_data
source: parquet
uri: "/absolute/path/data.parquet" # Used as-is
description: "Absolute path preserved unchanged"
# Remote URIs (not modified)
views:
- name: s3_data
source: parquet
uri: "s3://my-bucket/data/file.parquet" # Used as-is
description: "S3 paths unchanged"
Duckalog automatically preserves your configuration file format when making updates through the web UI:
ruamel.yaml when available for best resultspyyaml if needed.yaml, .yml, .json{/[ starts, YAML otherwiseAll configuration updates use atomic file operations:
We welcome contributions to duckalog! This section provides guidelines and instructions for contributing to the project.
Requirements: Python 3.12 or newer
This project uses automated version tagging to streamline releases. When you update the version in pyproject.toml and push to the main branch, the system automatically:
pyproject.tomlv{version} (e.g., v0.1.0)publish.yml workflow to publish to PyPISimple Release Process:
# 1. Update version in pyproject.toml
sed -i 's/version = "0.1.0"/version = "0.1.1"/' pyproject.toml
# 2. Commit and push
git add pyproject.toml
git commit -m "bump: Update version to 0.1.1"
git push origin main
# 3. Automated tagging creates tag and triggers publishing
# Tag v0.1.1 is created automatically
# publish.yml workflow runs and publishes to PyPI
For detailed examples and troubleshooting, see:
Duckalog uses a streamlined GitHub Actions setup to keep CI predictable:
twine check, smoke-tests the wheel, and then reuses the artifacts for Test PyPI, PyPI, or dry-run scenarios. Release jobs rely on the Tests workflow’s status rather than re-running the full test matrix.For local development, we recommend:
uv run ruff check src/ tests/ to run lint checks (CI treats these as required).uv run ruff format src/ tests/ to auto-format code (CI runs ruff format --check in advisory mode).uv run mypy src/duckalog to run type checks.# Clone the repository
git clone https://github.com/legout/duckalog.git
cd duckalog
# Install in development mode
uv pip install -e .
# Clone the repository
git clone https://github.com/legout/duckalog.git
cd duckalog
# Install in development mode
pip install -e .
# Using uv
uv pip install -e ".[dev]"
# Using pip
pip install -e ".[dev]"
We follow the conventions documented in openspec/project.md:
AttachmentConfig, ViewConfig)We use pytest for testing. The test suite includes both unit and integration tests:
# Run all tests
pytest
# Run with coverage
pytest --cov=duckalog
# Run specific test file
pytest tests/test_config.py
Testing Strategy:
For significant changes, we use OpenSpec to manage proposals and specifications:
Create a change proposal: Use the OpenSpec CLI to create a new change
openspec new "your-change-description"
Define requirements: Write specs with clear requirements and scenarios in changes/<id>/specs/
Plan implementation: Break down the work into tasks in changes/<id>/tasks.md
Validate your proposal: Ensure it meets project standards
openspec validate <change-id> --strict
Implement and test: Work through the tasks sequentially
See openspec/project.md for detailed project conventions and the OpenSpec workflow.
When submitting pull requests:
Branch naming: Use small, focused branches with the OpenSpec change-id (e.g., add-s3-parquet-support)
Commit messages:
openspec/, docs/) and implementation changes (src/, tests/) clearPR description: Include a clear description of the change and link to relevant OpenSpec proposals
Testing: Ensure all tests pass and add new tests for new functionality
Review process: Be responsive to review feedback and address all comments
We prefer incremental, reviewable PRs over large multi-feature changes.
plan/PRD_Spec.md for the full product and technical specificationopenspec/project.md for detailed development guidelinesThank you for contributing to duckalog! 🚀
FAQs
Python library and CLI to build DuckDB catalogs from declarative YAML/JSON configs
We found that duckalog demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
New DoS and source code exposure bugs in React Server Components and Next.js: what’s affected and how to update safely.

Security News
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.

Security News
GitHub has revoked npm classic tokens for publishing; maintainers must migrate, but OpenJS warns OIDC trusted publishing still has risky gaps for critical projects.