Credito Emiliano - Feature Selection, Transformation and Elimination (CE - FeSTE)
This repo contains the 'FeSTE' python package which helps in the features management from the pre-filtering to the pre-processing and feature elimination.
Installation
To install it:
- Optional: create a new Python virtual environment (through bash terminal run: "py -m venv your_env_name" and then "source your_env_name/Scripts/activate )
- Install the package:
- User Mode:
pip install cefeste
Structure
The .py package is stored in src and contains 3 sub-modules:
- selection: contains the feature preliminary selection functions
- transform: contains the feature pre-processing functions
- elimination: contains the feature elimination functions
Filters
Selection
The main class of this module is FeatureSelection. It applies several filters that can be grouped in the following:
- Univariate filters:
- No constant features
- Number of distinct value too low
- Number of missing values too high
- Too concentrate in the most frequent value
- Unstable between sets
- Multivariate filters:
- Spearman Correlation for numerical features
- Cramer's V for categorical features
- R2 for mixed features
- VIF
- Explanatory filters:
- Feature AUROC for classification
- Feature Correlation with target for regression
Trasformation
It is more a technical module which contains 3 classes useful for generating the production pipeline:
- ColumnExtractor: to extract columns from a pd.DataFrame
- ColumnRenamer: to rename columns and to transform a np.ndarray to a pd.DataFrame
- Categorizer: to trasform the dtype of pd.DataFrame columns from 'object' to 'category'
Elimination
The main class of this module is FeatureElimination which is useful for selecting the most useful feature to keep in the model and optimize the hyperparams in the meanwhile.
It is a recursive method that at each iteration can:
- Perform the hyperparameters optimization using user-defined model, grid, gridsearch method, evaluation measure
- Calculate the feature shap importance value
- Identify the last importance feature(/s) and Delete them for the next iteration