Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
pandas-dq
is the ultimate data quality toolkit for pandas dataframes.
pandas_dq
is a new python library for data quality analysis and improvement. It is fast, efficient and scalable. pandas-dq
is:
The new pandas_dq
library in Python is a great addition to the pandas
ecosystem. It provides a set of tools for data quality assessment, which can be used to identify and address potential problems with data sets. This can help to improve the quality of data analysis and ensure that results are reliable.
The pandas_dq
library is still under development, but it already includes a number of useful features. These include:
age
feature and another for income
feature.The pandas_dq
library is a valuable tool for anyone who works with data. It can help you to improve the quality of your data analysis and ensure that your results are reliable.
Here are some of the benefits of using the pandas_dq library:
scikit-learn
pipelines.Alert!: If you are using pandas version 2.0
("the new pandas"), beware that weird errors are popping up in all kinds of libraries that use pandas underneath. Our pandas_dq
library is no exception. So if you plan to use pandas_dq
with pandas version 2.0
, beware that you may see weird errors and we can't and won't fix them!
pandas_dq
has the following main modules:
pandas_dq
is designed to provide you the cleanest features with the fewest steps.
pandas_dq
has multiple important modules: dq_report
, Fix_DQ
and now DataSchemaChecker
.
`dq_report` displays a data quality report (inline or HTML) after it analyzes your dataset looking for these issues:
dc_report
is a data comparison tool that accepts two pandas dataframes as input and returns a report highlighting any differences between them. For example:
`Fix_DQ` is a great way to clean an entire train data set and apply the same steps in an MLOps pipeline to a test dataset. `Fix_DQ` can be used to detect most issues in your data (similar to dq_report but without the `target` related issues) in one step. Then it fixes those issues it finds during the `fit` method by the `transform` method. This transformer can then be saved (or "pickled") for applying the same steps on test data either at the same time or later.
Fix_DQ will perform following data quality cleaning steps:
How can we use Fix_DQ in GridSearchCV to find the best model pipeline?
This is another way to find the best data cleaning steps for your train data and then use the cleaned data in hyper parameter tuning using GridSearchCV or RandomizedSearchCV along with a LightGBM or an XGBoost or a scikit-learn model.
The DataSchemaChecker class has two methods: fit and transform. You need to initialize the class with a schema that you want to compare your data's dtypes against. A schema is a dictionary that maps column names to data types.
The fit method takes a dataframe as an argument and checks if it matches the schema. The fit method first checks if the number of columns in the dataframe and the schema are equal. If not, it creates an exception. Finally, the fit method displays a table of exceptions it found in your data against the given schema.
The transform method takes a dataframe as an argument and based on the given schema and the exceptions, converts all the exception data columns to the given schema. If it is not able to transform the column, it skips the column and displays out an error message.
Prerequsites:
pip install pandas_dq
To install from source:
cd <pandas_dq_Destination>
git clone git@github.com:AutoViML/pandas_dq.git
or download and unzip https://github.com/AutoViML/pandas_dq/archive/master.zip
conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd pandas_dq
pip install -r requirements.txt
from pandas_dq import dq_report
dqr = dq_report(data, target=target, html=False, csv_engine="pandas", verbose=1)
It displays a data quality report like this inline or in HTML format (and it saves the HTML to your machine):
dc_report
is data comparison tool that accepts two pandas dataframes as input and returns a report highlighting any differences between them. It can also provide a report in HTML format as below.
from pandas_dq import dc_report
dc_report = dc_report(train, test, exclude=[], html=True, verbose=1)
Fix_DQ
as a scikit-learn compatible transformerfrom pandas_dq import Fix_DQ
# Create an instance of the fix_data_quality transformer with default parameters
fdq = Fix_DQ()
# Fit the transformer on X_train and transform it
X_train_transformed = fdq.fit_transform(X_train)
# Transform X_test using the fitted transformer
X_test_transformed = fdq.transform(X_test)
Once you define the schema as below, you can use it as follows:
schema = {'name': 'string',
'age': 'float32',
'gender': 'object',
'income': 'float64',
'date': 'date',
'target': 'integer'}
from pandas_dq import DataSchemaChecker
ds = DataSchemaChecker(schema=schema)
ds.fit_transform(X_train)
df.transform(X_test)
pandas_dq has a very simple API with one major goal: find data quality issues in your data and fix them.
Arguments
dq_report
has the following arguments:Caution: For very large data sets, we randomly sample 100K rows from your CSV file to speed up reporting. If you want a larger sample, simply read in your file offline into a pandas dataframe and send it in as input, and we will load it as it is. This is one way to go around our speed limitations:
data
: You can provide any kind of file format (string) or even a pandas DataFrame (df). It reads parquet, csv, feather, arrow, all kinds of file formats straight from disk. You just have to tell it the path to the file and the name of the file.target
: default: None
. Otherwise, it should be a string name representing the name of a column in df. You can leave it as None
if you don't want any target related issues.html
: default is False
. If you want to display your report in HTML in a browser, set it to True
. Otherwise, it defaults to inline in a notebook or prints on the terminal. It also saves the HTML file in your working directory in your machine.csv_engine
: default is pandas
. If you want to load your CSV file using any other backend engine such as arrow
or parquet
please specify it here. This option only impacts CSV files.verbose
: This has 2 possible states:
0
summary report. displays only the summary level data quality issues in the dataset. Great for managers.1
detailed report. displays all the gory details behind each DQ issue in your dataset and what to do about them. Great for engineers.dataframe
: If verbose=1, it returns a dataframe with detailed data quality issues with your data. If verbose=0, it returns with a dataframe containing only the highlights of the data quality issues.dc_report
returns a dataframe highlighting differences between two dataframes, typically train and test. It has the following inputs and outputs:
train
: a dataframetest
: a dataframeexclude
: an empty list or a list of columns that you want to exclude from comparison in both dataframeshtml
: return a HTML file containing the differences between the two dataframesverbose
: 0 will return just the highlights of differences. 1 will return a detailed description of differences between the two dataframes.dataframe
: If verbose=1, it returns a dataframe with the following column names: Column Name, Data Type Train, Data Type Test, Missing Values% Train, Missing Values% Test, Unique Values% Train, Unique Values% Test, Minimum Value Train, Minimum Value Test, Maximum Value Train, Maximum Value Test, DQ Issue Train, DQ Issue Test, Distribution Difference. If verbose=0, it will return only the following columns: Column Name, DQ Issue Train, DQ Issue Test, Distribution Difference.Fix_DQ
is a scikit-learn transformer. It finds and fixes data quality issues in your dataCaution: X_train and X_test in Fix_DQ must be pandas Dataframes or pandas Series. I have not tested it on numpy arrays. You can try your luck.
X_train
: a pandas dataframeX_test
: a pandas dataframequantile
: float (0.75): Define a threshold for IQR for outlier detection. Could be any float between 0 and 1. If quantile is set to None
, then no outlier detection will take place.cat_fill_value
: string ("missing") or a dictionary: Define a fill value for missing categories in your object or categorical variables. This is a global default for your entire dataset. You can also give a dictionary where you specify different fill values for different columns.num_fill_value
: integer (99) or float value (999.0) or a dictionary: Define a fill value for missing numbers in your integer or float variables. This is a global default for your entire dataset. You can also give a dictionary where you specify different fill values for different columns.rare_threshold
: float (0.05): Define a threshold for rare categories. If a certain category in a column is less than say 5% (0.05) of samples, then it will be considered rare. All rare categories in that column will be merged under a new category named "Rare".correlation_threshold
: float (0.8): Define a correlation limit. Anything above this limit, if two variables are correlated, one of them will be dropped. The program will tell you which variable is being dropped. You can switch the sequence of variables in your dataset if you want the one or other dropped.DataSchemaChecker
is a scikit-learn transformer. It checks you data against a given schemaschema
: dictionary. A schema (dict) is a dictionary that maps column names to data types. This schema will determine the data types of whatever dataframe you want to comply with.DataSchemaChecker has two methods:
fit
method: Checks if the given dataframe matches the schema and displays a table of errors if any.transform
method: Transforms the given dataframe's dtypes to the given schema and displays errors if any.PRs accepted.
Apache License 2.0 © 2020 Ram Seshadri
This libray would not have been possible without the help of ChatGPT and Bard. This library is dedicated to the thousands of people who worked to create LLM's.
This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.
FAQs
Clean your data using a scikit-learn transformer in a single line of code
We found that pandas-dq demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.