New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →
Socket
Book a DemoSign in
Socket

dataprof

Package Overview
Dependencies
Maintainers
1
Versions
27
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

dataprof

Fast, lightweight data profiling and quality assessment library

pipPyPI
Version
0.6.2
Maintainers
1
dataprof logo

dataprof

High-performance data profiling with ISO 8000/25012 quality metrics

Crates.io docs.rs PyPI License: MIT OR Apache-2.0

dataprof is a Rust library and CLI for profiling tabular data. It computes column-level statistics, detects data types and patterns, and evaluates data quality against the ISO 8000/25012 standard -- all with bounded memory usage that lets you profile datasets far larger than your available RAM.

[!NOTE] This is a work in progress, the project is now over 6 months old and the API is not stable and may change in future versions. Please, report any issues or suggestions you may have.

Highlights

  • Rust core -- fast columnar and streaming engines
  • ISO 8000/25012 quality assessment -- five dimensions: Completeness, Consistency, Uniqueness, Accuracy, Timeliness
  • Multi-format -- CSV (auto-delimiter detection), JSON, JSONL, Parquet, databases, DataFrames, Arrow
  • True streaming -- bounded-memory profiling with online algorithms (Incremental engine)
  • Three interfaces -- CLI binary, Rust library, Python package
  • Async-ready -- async/await API for embedding in web services and stream pipelines

Quick Start

CLI

cargo install dataprof

dataprof analyze data.csv --detailed
dataprof schema data.csv
dataprof count data.parquet

Rust

use dataprof::Profiler;

let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:.1}%", report.quality_score().unwrap_or(0.0));

for col in &report.column_profiles {
    println!("  {} ({:?}): {} nulls", col.name, col.data_type, col.null_count);
}

Python

import dataprof

report = dataprof.profile("data.csv")
print(f"{report.rows} rows, {report.columns} columns")
print(f"Quality score: {report.quality_score}")

for col in report.column_profiles.values():
    print(f"  {col.name} ({col.data_type}): {col.null_percentage:.1f}% null")

Installation

CLI binary

cargo install dataprof                        # default (CLI only)
cargo install dataprof --features full-cli    # CLI + all formats + databases

Rust library

[dependencies]
dataprof = "0.6"                  # core library (no CLI deps)
dataprof = { version = "0.6", features = ["async-streaming"] }

Python package

uv pip install dataprof
# or
pip install dataprof

Feature Flags

FeatureDescription
cli (default)CLI binary with clap, colored output, progress bars
minimalCSV-only, no CLI -- fastest compile
async-streamingAsync profiling engine with tokio
parquet-asyncProfile Parquet files over HTTP
databaseDatabase profiling (connection handling, retry, SSL)
postgresPostgreSQL connector (includes database)
mysqlMySQL/MariaDB connector (includes database)
sqliteSQLite connector (includes database)
all-dbAll three database connectors
datafusionDataFusion SQL engine integration
pythonPython bindings via PyO3
python-asyncAsync Python API (includes python + async-streaming)
full-cliCLI + Parquet + all databases
productionPostgreSQL + MySQL (common deployment)

Supported Formats

FormatEngineNotes
CSVIncremental, ColumnarAuto-detects , ; | \t delimiters
JSONIncrementalArray-of-objects
JSONL / NDJSONIncrementalOne object per line
ParquetColumnarReads metadata for schema/count without scanning rows
Database queryAsyncPostgreSQL, MySQL, SQLite via connection string
pandas / polars DataFrameColumnarPython API only
Arrow RecordBatchColumnarVia PyCapsule (zero-copy) or Rust API
Async byte streamIncrementalAny AsyncRead source (HTTP, WebSocket, etc.)

Quality Metrics

dataprof evaluates data quality against the five dimensions defined in ISO 8000-8 and ISO/IEC 25012:

DimensionWhat it measures
CompletenessMissing values ratio, complete records ratio, fully-null columns
ConsistencyData type consistency, format violations, encoding issues
UniquenessDuplicate rows, key uniqueness, high-cardinality warnings
AccuracyOutlier ratio, range violations, negative values in positive-only columns
TimelinessFuture dates, stale data ratio, temporal ordering violations

An overall quality score (0 -- 100) is computed as a weighted average of dimension scores.

Documentation

Academic Work

dataprof is the subject of a peer-reviewed paper submitted to IEEE ScalCom 2026:

A. Bozzo, "A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling," IEEE ScalCom 2026 (under review). [Repository & reproducible benchmarks]

The paper benchmarks dataprof against YData Profiling, Polars, and pandas across execution efficiency, memory scalability, energy consumption, and zero-copy interoperability in constrained Edge AI environments.

BibTeX

@inproceedings{bozzo2026compiled,
  author={Bozzo, Andrea},
  title={A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling},
  booktitle={2026 IEEE International Conference on Scalable Computing and Communications (ScalCom)},
  year={2026},
  note={Under review}
}

License

Dual-licensed under either the MIT License or the Apache License, Version 2.0, at your option.

Keywords

data

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts