New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details → →

Book a Demo Sign in

dataprof

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

dataprof

Fast, lightweight data profiling and quality assessment library

PyPI

Version: 0.6.2

Maintainers: 1

dataprof

High-performance data profiling with ISO 8000/25012 quality metrics

dataprof is a Rust library and CLI for profiling tabular data. It computes column-level statistics, detects data types and patterns, and evaluates data quality against the ISO 8000/25012 standard -- all with bounded memory usage that lets you profile datasets far larger than your available RAM.

[!NOTE] This is a work in progress, the project is now over 6 months old and the API is not stable and may change in future versions. Please, report any issues or suggestions you may have.

Highlights

Rust core -- fast columnar and streaming engines
ISO 8000/25012 quality assessment -- five dimensions: Completeness, Consistency, Uniqueness, Accuracy, Timeliness
Multi-format -- CSV (auto-delimiter detection), JSON, JSONL, Parquet, databases, DataFrames, Arrow
True streaming -- bounded-memory profiling with online algorithms (Incremental engine)
Three interfaces -- CLI binary, Rust library, Python package
Async-ready -- async/await API for embedding in web services and stream pipelines

Quick Start

CLI

cargo install dataprof

dataprof analyze data.csv --detailed
dataprof schema data.csv
dataprof count data.parquet

Rust

use dataprof::Profiler;

let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:.1}%", report.quality_score().unwrap_or(0.0));

for col in &report.column_profiles {
    println!("  {} ({:?}): {} nulls", col.name, col.data_type, col.null_count);
}

Python

import dataprof

report = dataprof.profile("data.csv")
print(f"{report.rows} rows, {report.columns} columns")
print(f"Quality score: {report.quality_score}")

for col in report.column_profiles.values():
    print(f"  {col.name} ({col.data_type}): {col.null_percentage:.1f}% null")

Installation

CLI binary

cargo install dataprof                        # default (CLI only)
cargo install dataprof --features full-cli    # CLI + all formats + databases

Rust library

[dependencies]
dataprof = "0.6"                  # core library (no CLI deps)
dataprof = { version = "0.6", features = ["async-streaming"] }

Python package

uv pip install dataprof
# or
pip install dataprof

Feature Flags

Feature	Description
`cli` (default)	CLI binary with clap, colored output, progress bars
`minimal`	CSV-only, no CLI -- fastest compile
`async-streaming`	Async profiling engine with tokio
`parquet-async`	Profile Parquet files over HTTP
`database`	Database profiling (connection handling, retry, SSL)
`postgres`	PostgreSQL connector (includes `database`)
`mysql`	MySQL/MariaDB connector (includes `database`)
`sqlite`	SQLite connector (includes `database`)
`all-db`	All three database connectors
`datafusion`	DataFusion SQL engine integration
`python`	Python bindings via PyO3
`python-async`	Async Python API (includes `python` + `async-streaming`)
`full-cli`	CLI + Parquet + all databases
`production`	PostgreSQL + MySQL (common deployment)

Supported Formats

Format	Engine	Notes
CSV	Incremental, Columnar	Auto-detects `,` `;` `\|` `\t` delimiters
JSON	Incremental	Array-of-objects
JSONL / NDJSON	Incremental	One object per line
Parquet	Columnar	Reads metadata for schema/count without scanning rows
Database query	Async	PostgreSQL, MySQL, SQLite via connection string
pandas / polars DataFrame	Columnar	Python API only
Arrow RecordBatch	Columnar	Via PyCapsule (zero-copy) or Rust API
Async byte stream	Incremental	Any `AsyncRead` source (HTTP, WebSocket, etc.)

Quality Metrics

dataprof evaluates data quality against the five dimensions defined in ISO 8000-8 and ISO/IEC 25012:

Dimension	What it measures
Completeness	Missing values ratio, complete records ratio, fully-null columns
Consistency	Data type consistency, format violations, encoding issues
Uniqueness	Duplicate rows, key uniqueness, high-cardinality warnings
Accuracy	Outlier ratio, range violations, negative values in positive-only columns
Timeliness	Future dates, stale data ratio, temporal ordering violations

An overall quality score (0 -- 100) is computed as a weighted average of dimension scores.

Documentation

CLI Usage Guide -- every subcommand and flag
Python API Guide -- profile(), report types, async, databases
Getting Started -- tutorial from zero to profiling
Examples Cookbook -- copy-pasteable recipes (CLI, Python, Rust)
Database Connectors -- PostgreSQL, MySQL, SQLite setup
Contributing
Changelog

Academic Work

dataprof is the subject of a peer-reviewed paper submitted to IEEE ScalCom 2026:

A. Bozzo, "A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling," IEEE ScalCom 2026 (under review). [Repository & reproducible benchmarks]

The paper benchmarks dataprof against YData Profiling, Polars, and pandas across execution efficiency, memory scalability, energy consumption, and zero-copy interoperability in constrained Edge AI environments.

BibTeX

@inproceedings{bozzo2026compiled,
  author={Bozzo, Andrea},
  title={A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling},
  booktitle={2026 IEEE International Conference on Scalable Computing and Communications (ScalCom)},
  year={2026},
  note={Under review}
}

License

Dual-licensed under either the MIT License or the Apache License, Version 2.0, at your option.

Keywords

FAQs

What is dataprof?

Is dataprof well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

dataprof

dataprof

Highlights

Quick Start

CLI

Rust

Python

Installation

CLI binary

Rust library

Python package

Feature Flags

Supported Formats

Quality Metrics

Documentation

Academic Work

BibTeX

License

Keywords

Related posts

Axios Maintainer Confirms Social Engineering Attack Behind npm Compromise

Node.js Drops Bug Bounty Rewards After Funding Dries Up