Categorical Encoding Methods
A set of scikit-learn-style transformers for encoding categorical
variables into numeric by means of different techniques.
Important Links
Documentation: http://contrib.scikit-learn.org/category_encoders/
Encoding Methods
Unsupervised:
- Backward Difference Contrast [2][3]
- BaseN [6]
- Binary [5]
- Gray [14]
- Count [10]
- Hashing [1]
- Helmert Contrast [2][3]
- Ordinal [2][3]
- One-Hot [2][3]
- Rank Hot [15]
- Polynomial Contrast [2][3]
- Sum Contrast [2][3]
Supervised:
- CatBoost [11]
- Generalized Linear Mixed Model [12]
- James-Stein Estimator [9]
- LeaveOneOut [4]
- M-estimator [7]
- Target Encoding [7]
- Weight of Evidence [8]
- Quantile Encoder [13]
- Summary Encoder [13]
Installation
The package requires: numpy
, statsmodels
, and scipy
.
To install the package, execute:
$ python setup.py install
or
pip install category_encoders
or
conda install -c conda-forge category_encoders
To install the development version, you may use:
pip install --upgrade git+https://github.com/scikit-learn-contrib/category_encoders
Usage
All of the encoders are fully compatible sklearn transformers, so they can be used in pipelines or in your existing
scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all
columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific
configuration options.
Examples
There are two types of encoders: unsupervised and supervised. An unsupervised example:
from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston
bunch = load_boston()
y = bunch.target
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X)
numeric_dataset = enc.transform(X)
And a supervised example:
from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston
bunch = load_boston()
y_train = bunch.target[0:250]
y_test = bunch.target[250:506]
X_train = pd.DataFrame(bunch.data[0:250], columns=bunch.feature_names)
X_test = pd.DataFrame(bunch.data[250:506], columns=bunch.feature_names)
enc = TargetEncoder(cols=['CHAS', 'RAD'])
training_numeric_dataset = enc.fit_transform(X_train, y_train)
testing_numeric_dataset = enc.transform(X_test)
For the transformation of the training data with the supervised methods, you should use fit_transform()
method instead of fit().transform()
, because these two methods do not have to generate the same result. The difference can be observed with LeaveOneOut encoder, which performs a nested cross-validation for the training data in fit_transform()
method (to decrease over-fitting of the downstream model) but uses all the training data for scoring with transform()
method (to get as accurate estimates as possible).
Furthermore, you may benefit from following wrappers:
- PolynomialWrapper, which extends supervised encoders to support polynomial targets
- NestedCVWrapper, which helps to prevent overfitting
Additional examples and benchmarks can be found in the examples
directory.
Contributing
Category encoders is under active development, if you'd like to be involved, we'd love to have you. Check out the CONTRIBUTING.md file
or open an issue on the github project to get started.
References
- Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
- Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. From https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.
- Gregory Carey (2003). Coding Categorical Variables. From http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
- Owen Zhang - Leave One Out Encoding. From https://datascience.stackexchange.com/questions/10839/what-is-difference-between-one-hot-encoding-and-leave-one-out-encoding
- Beyond One-Hot: an exploration of categorical variables. From http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
- BaseN Encoding and Grid Search in categorical variables. From http://www.willmcginnis.com/2016/12/18/basen-encoding-grid-search-category_encoders/
- Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. From http://dx.doi.org/10.1145/507533.507538
- Weight of Evidence (WOE) and Information Value Explained. From https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
- Empirical Bayes for multiple sample sizes. From http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/
- Simple Count or Frequency Encoding. From https://www.datacamp.com/community/tutorials/encoding-methodologies
- Transforming categorical features to numerical features. From https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/
- Andrew Gelman and Jennifer Hill (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. From https://faculty.psau.edu.sa/filedownload/doc-12-pdf-a1997d0d31f84d13c1cdc44ac39a8f2c-original.pdf
- Carlos Mougan, David Masip, Jordi Nin and Oriol Pujol (2021). Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems. Modeling Decisions for Artificial Intelligence, 2021. Springer International Publishing https://link.springer.com/chapter/10.1007%2F978-3-030-85529-1_14
- Gray Encoding. From https://en.wikipedia.org/wiki/Gray_code
- Jacob Buckman, Aurko Roy, Colin Raffel, Ian Goodfellow: Thermometer Encoding: One Hot Way To Resist Adversarial Examples. From https://openreview.net/forum?id=S18Su--CW
- Fairness implications of encoding protected categorical attributes. Carlos Mougan, Jose Alvarez, Salvatore Ruggieri, and Steffen Staab. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, https://arxiv.org/abs/2201.11358