New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

drift-shield

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

drift-shield

A package to monitor and track data drift for ML models

0.5
PyPI

Maintainers: 1

DriftShield

DriftShield is a Python package designed to detect and handle data drift in machine learning pipelines. It compares distributions of numeric and non-numeric data between training and scoring datasets, helps identify drift, and replaces problematic values with predefined defaults. With built-in outlier handling and statistical tests, DriftShield ensures that your data remains consistent and prevents performance degradation caused by unseen data changes.

Features

Detects data drift in non-numeric, numeric, and boolean columns.
Handles outliers when calculating means for numeric data.
Compares 25th, 50th, and 75th percentiles for numeric columns.
Tracks changes in proportions for boolean columns.
Provides mechanisms to replace drifted values with default values.
Customizable exclusion of columns from drift detection.

Installation

To install DriftShield, you can clone the repository and install it using pip:

git clone <>
cd driftshield
pip install .

Alternatively, you can install it directly from PyPI (after you’ve published it):

pip install drift_shield

Usage

DriftShield can be used to monitor and handle drift between training and scoring datasets. Here's a quick guide on how to use it:

1. Import the package

from drift_shield import data_drift, handle_data_drift

2. Detect Data Drift

In training mode, you can store distinct values and statistics for numeric/boolean columns.

data_drift('my_dataset', 'training', training_df, './buffer_dir', exclusions=['column_to_exclude'])

In scoring mode, it will compare the statistics from the stored buffer to detect drift.

data_drift('my_dataset', 'scoring', scoring_df, './buffer_dir', exclusions=['column_to_exclude'])

3. Handle Drift

If drift is detected, you can replace drifted values with values from a default DataFrame.

updated_df = handle_data_drift('my_dataset', scoring_df, './buffer_dir', default_replacements_df, exclusions=['column_to_exclude'])

4. Delete Drift Dump

To remove a stored drift file if you need to reset or rerun:

from drift_shield import delete_drift_dump

delete_drift_dump('my_dataset', './buffer_dir', type = 'data_drift')

5. Feature Importance Drift

To track feature importance drift between the training and scoring phases and detect any changes in feature importance.

from drift_shield import feature_importance_drift

In training mode, you can store feature importance and columns names to the json dump.

feature_importance_drift('my_dataset', 'training', model, df_training, './buffer_dir', target_column='target')

In scoring mode, this function compares feature importance of training to the scoring data to detect any significant drift. It supports models like RandomForest, XGBoost, and LinearRegression using SHAP values.

feature_importance_drift('my_dataset', 'scoring', model, df_scoring, './buffer_dir')

6. Monitor Data Volume Over Time

This function tracks the data volume over successive scoring phases and logs any significant changes based on a specified threshold (default is 20%).

monitor_data_volume_over_time('my_dataset', df_scoring, './buffer_dir', threshold=0.2)

Example Workflow

Training Phase:

Store distinct values and statistics:

data_drift('my_training_data', 'training', training_df, './buffer')

Scoring Phase:

Compare scoring data to the training statistics:

data_drift('my_training_data', 'scoring', scoring_df, './buffer')

Handling Drift:

Replace drifted values with defaults:

updated_df = handle_data_drift('my_training_data', scoring_df, './buffer', default_replacements_df)

To make changes to this package:

Clone it make changes, modifty requirements and setup.py, test and validate it
Increment the version number in setup.py
Go to the root folder of the package, 'pip install .'
'pip install twine'
'python setup.py sdist bdist_wheel'
twine upload 'dist/*' then provide your PyPI creds.
And push changes to the git repo

FAQs

What is drift-shield?

Is drift-shield well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install