You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

datadock

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

datadock

Datadock is a PySpark-based data interoperability library. It automatically detects schemas from heterogeneous files (CSV, JSON, Parquet), groups them by structural similarity, and performs standardized batch reads. Designed for pipelines handling non-uniform large-scale data, enabling robust integration and reuse in distributed environments.

0.1.2

PyPI

Maintainers: 1

Datadock

Datadock is a Python library built on top of PySpark, designed to simplify data interoperability between files of different formats and schemas in modern data engineering pipelines.

It automatically detects schemas from CSV, JSON and Parquet files, groups structurally similar files, and allows standardized reading of all grouped files into a single Spark DataFrame — even in highly heterogeneous datasets.

✨ Key Features

🚀 Automatic parsing of multiple file formats: .csv, .json, .parquet
🧠 Schema-based file grouping by structural similarity
📊 Auto-selection of dominant schemas
🛠️ Unified read across similar files into a single PySpark DataFrame
🔍 Schema insight for diagnostics and inspection

🔧 Installation

pip install datadock

🗂️ Expected Input Structure

Place your data files (CSV, JSON or Parquet) inside a single folder. The library will automatically detect supported files and organize them by schema similarity.

/data/input/
├── sales_2020.csv
├── sales_2021.csv
├── products.json
├── archive.parquet
├── log.parquet

🧪 Usage Example

from datadock import scan_schema, get_schema_info, read_data

path = "/path/to/your/data"

# Logs schema groups detected
scan_schema(path)

# Retrieves schema metadata
info = get_schema_info(path)
print(info)

# Loads all files from schema group 1
df = read_data(path, schema_id=1, logs=True)
df.show()

📌 Public API

`scan_schema`

Logs the identified schema groups found in the specified folder.

`get_schema_info`

Returns a list of dictionaries containing:

schema_id: ID of the schema group
file_count: number of files in the group
column_count: number of columns in the schema
files: list of file names in the group

`read_data`

Reads and merges all files that share the same schema.
If schema_id is not specified, the group with the most columns will be selected.

✅ Requirements

Python 3.10+
PySpark

📚 Motivation

In real-world data engineering workflows, it's common to deal with files that represent the same data domain but have slight structural variations — such as missing columns, different orders, or evolving schemas.
Datadock automates the process of grouping, inspecting, and reading these files reliably, allowing you to build pipelines that are schema-aware, scalable, and format-agnostic.

📄 License

This project is licensed under the MIT License.

FAQs

What is datadock?

Is datadock well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

datadock

Datadock

✨ Key Features

🔧 Installation

🗂️ Expected Input Structure

🧪 Usage Example

📌 Public API

scan_schema

get_schema_info

read_data

✅ Requirements

📚 Motivation

📄 License

Related posts

npm Phishing Email Targets Developers with Typosquatted Domain

Knip Hits 500 Releases with v5.62.0, Improving TypeScript Config Detection and Plugin Integrations

`scan_schema`

`get_schema_info`

`read_data`