DriftShield
DriftShield is a Python package designed to detect and handle data drift in machine learning pipelines. It compares distributions of numeric and non-numeric data between training and scoring datasets, helps identify drift, and replaces problematic values with predefined defaults. With built-in outlier handling and statistical tests, DriftShield ensures that your data remains consistent and prevents performance degradation caused by unseen data changes.
Features
- Detects data drift in non-numeric, numeric, and boolean columns.
- Handles outliers when calculating means for numeric data.
- Compares 25th, 50th, and 75th percentiles for numeric columns.
- Tracks changes in proportions for boolean columns.
- Provides mechanisms to replace drifted values with default values.
- Customizable exclusion of columns from drift detection.
Installation
To install DriftShield, you can clone the repository and install it using pip
:
git clone <>
cd driftshield
pip install .
Alternatively, you can install it directly from PyPI (after you’ve published it):
pip install drift_shield
Usage
DriftShield can be used to monitor and handle drift between training and scoring datasets. Here's a quick guide on how to use it:
1. Import the package
from drift_shield import data_drift, handle_data_drift
2. Detect Data Drift
In training mode, you can store distinct values and statistics for numeric/boolean columns.
data_drift('my_dataset', 'training', training_df, './buffer_dir', exclusions=['column_to_exclude'])
In scoring mode, it will compare the statistics from the stored buffer to detect drift.
data_drift('my_dataset', 'scoring', scoring_df, './buffer_dir', exclusions=['column_to_exclude'])
3. Handle Drift
If drift is detected, you can replace drifted values with values from a default DataFrame.
updated_df = handle_data_drift('my_dataset', scoring_df, './buffer_dir', default_replacements_df, exclusions=['column_to_exclude'])
4. Delete Drift Dump
To remove a stored drift file if you need to reset or rerun:
from drift_shield import delete_drift_dump
delete_drift_dump('my_dataset', './buffer_dir', type = 'data_drift')
5. Feature Importance Drift
To track feature importance drift between the training and scoring phases and detect any changes in feature importance.
from drift_shield import feature_importance_drift
In training mode, you can store feature importance and columns names to the json dump.
feature_importance_drift('my_dataset', 'training', model, df_training, './buffer_dir', target_column='target')
In scoring mode, this function compares feature importance of training to the scoring data to detect any significant drift. It supports models like RandomForest, XGBoost, and LinearRegression using SHAP values.
feature_importance_drift('my_dataset', 'scoring', model, df_scoring, './buffer_dir')
6. Monitor Data Volume Over Time
This function tracks the data volume over successive scoring phases and logs any significant changes based on a specified threshold (default is 20%).
monitor_data_volume_over_time('my_dataset', df_scoring, './buffer_dir', threshold=0.2)
Example Workflow
-
Training Phase:
- Store distinct values and statistics:
data_drift('my_training_data', 'training', training_df, './buffer')
-
Scoring Phase:
- Compare scoring data to the training statistics:
data_drift('my_training_data', 'scoring', scoring_df, './buffer')
-
Handling Drift:
- Replace drifted values with defaults:
updated_df = handle_data_drift('my_training_data', scoring_df, './buffer', default_replacements_df)
To make changes to this package:
- Clone it make changes, modifty requirements and setup.py, test and validate it
- Increment the version number in setup.py
- Go to the root folder of the package, 'pip install .'
- 'pip install twine'
- 'python setup.py sdist bdist_wheel'
- twine upload 'dist/*' then provide your PyPI creds.
- And push changes to the git repo