Data Validator Framework
A Python-based data validation framework for CSV and JSON data. This project provides a unified approach to validate data formats and contents using specialised validators built on a common foundation. The framework is easily extendable and leverages industry-standard libraries such as Pandas, Polars, and Pydantic.
Overview
The Data Validator Framework is designed to simplify and standardise data validation tasks. It consists of:
- CSV Validator: Validates CSV files using either the Pandas or Polars engine. It checks for issues such as missing data, incorrect data types, invalid date formats, fixed column values, and duplicate entries.
- JSON Validator: Validates JSON objects against Pydantic models, ensuring the data conforms to the expected schema while providing detailed error messages.
- Common Validator Base: An abstract
BaseValidator
class that defines a standard interface and error management for all validators.
- Custom Errors: A set of custom error classes that offer precise and informative error reporting, helping to identify and resolve data issues efficiently.
Features
-
CSV Validation
- Supports both Pandas and Polars engines.
- Reads multiple CSV files concurrently.
- Validates data types, missing data, date formats, fixed values, and uniqueness constraints.
-
JSON Validation
- Uses Pydantic for schema validation.
- Automatically converts JSON keys to strings to ensure compatibility.
- Aggregates and formats error messages for clarity.
-
Extensible Architecture
- A unified abstract base class (
BaseValidator
) that standardises validation methods.
- Customisable error handling with detailed messages.
Requirements
- Python: 3.10 or above.
- Dependencies:
- For CSV validation:
- For JSON validation:
Installation
--
Usage
CSV Validation Example
from processor import CSVValidator
validator = CSVValidator(
csv_paths=["data/file1.csv", "data/file2.csv"],
data_types=["str", "int", "float"],
column_names=["id", "name", "value"],
unique_value_columns=["id"],
columns_with_no_missing_data=["name"],
missing_data_column_mapping={"value": ["NaN", "None"]},
valid_column_values={"name": ["Alice", "Bob", "Charlie"]},
drop_columns=["unused_column"],
strict_validation=True,
)
validator.validate()
JSON Validation Example
from pydantic import BaseModel
from processor import JSONValidator
class UserModel(BaseModel):
id: int
name: str
email: str
json_data = {
"id": 123,
"name": "Alice",
"email": "alice@example.com"
}
validator = JSONValidator(model=UserModel, input_=json_data)
validator.validate()
--
Project Structure
validator/
├── .gitignore
├── .python-version
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE
├── pyproject.toml
├── uv.lock
├── requirements.txt
├── README.md
├── src/
└── validator/
└── config/
│ ├── __init.py__
│ └── csv_.py
├── __init.py__
├── base.py
├── errors.py
├── README.md
├── csv/
│ ├── __init.py__
│ ├── README.md
│ └── main.py
└── json/
│ ├── __init.py__
│ ├── README.md
│ └── main.py
└── tests/
├── __init.py__
├── config.py
|── integration/
| ├── __init.py__
| ├── test_integration_json.py
| └── test_integration_csv.py
├── unit/
| ├── __init.py__
| ├── test_csv.py
| └── test_json.py
├── csvs/
└── jsons/
Contributing
Contributions are welcome! Please adhere to standard code review practices and ensure your contributions are well tested and documented.
Licence
This project is licensed under the MIT License. See the LICENSE file for details.
For developers
To generate requirements.txt
uv export --format requirements.txt --no-emit-project --no-emit-workspace --no-annotate --no-header --no-hashes --no-editable -o requirements.txt
To generate CHANGELOG.md
uv run git-cliff -o CHANGELOG.md
To bump version.
uv run bump-my-version show-bump