
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
dataprof
Advanced tools
dataprof is a Rust library and CLI for profiling tabular data. It computes column-level statistics, detects data types and patterns, and evaluates data quality against the ISO 8000/25012 standard -- all with bounded memory usage that lets you profile datasets far larger than your available RAM.
[!NOTE] This is a work in progress, the project is now over 6 months old and the API is not stable and may change in future versions. Please, report any issues or suggestions you may have.
async/await API for embedding in web services and stream pipelinescargo install dataprof
dataprof analyze data.csv --detailed
dataprof schema data.csv
dataprof count data.parquet
use dataprof::Profiler;
let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:.1}%", report.quality_score().unwrap_or(0.0));
for col in &report.column_profiles {
println!(" {} ({:?}): {} nulls", col.name, col.data_type, col.null_count);
}
import dataprof
report = dataprof.profile("data.csv")
print(f"{report.rows} rows, {report.columns} columns")
print(f"Quality score: {report.quality_score}")
for col in report.column_profiles.values():
print(f" {col.name} ({col.data_type}): {col.null_percentage:.1f}% null")
cargo install dataprof # default (CLI only)
cargo install dataprof --features full-cli # CLI + all formats + databases
[dependencies]
dataprof = "0.6" # core library (no CLI deps)
dataprof = { version = "0.6", features = ["async-streaming"] }
uv pip install dataprof
# or
pip install dataprof
| Feature | Description |
|---|---|
cli (default) | CLI binary with clap, colored output, progress bars |
minimal | CSV-only, no CLI -- fastest compile |
async-streaming | Async profiling engine with tokio |
parquet-async | Profile Parquet files over HTTP |
database | Database profiling (connection handling, retry, SSL) |
postgres | PostgreSQL connector (includes database) |
mysql | MySQL/MariaDB connector (includes database) |
sqlite | SQLite connector (includes database) |
all-db | All three database connectors |
datafusion | DataFusion SQL engine integration |
python | Python bindings via PyO3 |
python-async | Async Python API (includes python + async-streaming) |
full-cli | CLI + Parquet + all databases |
production | PostgreSQL + MySQL (common deployment) |
| Format | Engine | Notes |
|---|---|---|
| CSV | Incremental, Columnar | Auto-detects , ; | \t delimiters |
| JSON | Incremental | Array-of-objects |
| JSONL / NDJSON | Incremental | One object per line |
| Parquet | Columnar | Reads metadata for schema/count without scanning rows |
| Database query | Async | PostgreSQL, MySQL, SQLite via connection string |
| pandas / polars DataFrame | Columnar | Python API only |
| Arrow RecordBatch | Columnar | Via PyCapsule (zero-copy) or Rust API |
| Async byte stream | Incremental | Any AsyncRead source (HTTP, WebSocket, etc.) |
dataprof evaluates data quality against the five dimensions defined in ISO 8000-8 and ISO/IEC 25012:
| Dimension | What it measures |
|---|---|
| Completeness | Missing values ratio, complete records ratio, fully-null columns |
| Consistency | Data type consistency, format violations, encoding issues |
| Uniqueness | Duplicate rows, key uniqueness, high-cardinality warnings |
| Accuracy | Outlier ratio, range violations, negative values in positive-only columns |
| Timeliness | Future dates, stale data ratio, temporal ordering violations |
An overall quality score (0 -- 100) is computed as a weighted average of dimension scores.
profile(), report types, async, databasesdataprof is the subject of a peer-reviewed paper submitted to IEEE ScalCom 2026:
A. Bozzo, "A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling," IEEE ScalCom 2026 (under review). [Repository & reproducible benchmarks]
The paper benchmarks dataprof against YData Profiling, Polars, and pandas across execution efficiency, memory scalability, energy consumption, and zero-copy interoperability in constrained Edge AI environments.
@inproceedings{bozzo2026compiled,
author={Bozzo, Andrea},
title={A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling},
booktitle={2026 IEEE International Conference on Scalable Computing and Communications (ScalCom)},
year={2026},
note={Under review}
}
Dual-licensed under either the MIT License or the Apache License, Version 2.0, at your option.
FAQs
Fast, lightweight data profiling and quality assessment library
We found that dataprof demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.