Selective: Feature Selection Library
Selective is a white-box feature selection library that supports supervised and unsupervised selection methods for classification and regression tasks.
Selective also provides optimized item selection based on diversity of text embeddings (via TextWiser) and
the coverage of binary labels by solving a multi-objective optimization problem (CPAIOR'21, DSO@IJCAI'22). The approach showed to speed-up online experimentation significantly and boost recommender systems NVIDIA GTC'22.
The library provides:
- Simple to complex selection methods: Variance, Correlation, Statistical, Linear, Tree-based, or Customized.
- Text-based selection to maximize diversity in text embeddings and metadata coverage.
- Interoperable with data frames as the input.
- Automated task detection. No need to know what feature selection method works with what machine learning task.
- Benchmarking multiple selectors using cross-validation with built-in parallelization.
- Inspection of the results and feature importance.
Selective is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments.
Quick Start
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from feature.selector import Selective, SelectionMethod
data, label = get_data_label(fetch_california_housing())
selector = Selective(SelectionMethod.Variance(threshold=0.0))
selector = Selective(SelectionMethod.Correlation(threshold=0.5, method="pearson"))
selector = Selective(SelectionMethod.Statistical(num_features=3, method="anova"))
selector = Selective(SelectionMethod.Linear(num_features=3, regularization="none"))
selector = Selective(SelectionMethod.TreeBased(num_features=3))
subset = selector.fit_transform(data, label)
print("Reduction:", list(subset.columns))
print("Scores:", list(selector.get_absolute_scores()))
Available Methods
Benchmarking
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from xgboost import XGBClassifier, XGBRegressor
from feature.selector import SelectionMethod, benchmark, calculate_statistics
data, label = get_data_label(fetch_california_housing())
corr_threshold = 0.5
num_features = 3
tree_params = {"n_estimators": 50, "max_depth": 5, "random_state": 111, "n_jobs": 4}
selectors = {
"corr_pearson": SelectionMethod.Correlation(corr_threshold, method="pearson"),
"corr_kendall": SelectionMethod.Correlation(corr_threshold, method="kendall"),
"corr_spearman": SelectionMethod.Correlation(corr_threshold, method="spearman"),
"stat_anova": SelectionMethod.Statistical(num_features, method="anova"),
"stat_chi_square": SelectionMethod.Statistical(num_features, method="chi_square"),
"stat_mutual_info": SelectionMethod.Statistical(num_features, method="mutual_info"),
"linear": SelectionMethod.Linear(num_features, regularization="none"),
"lasso": SelectionMethod.Linear(num_features, regularization="lasso", alpha=1000),
"ridge": SelectionMethod.Linear(num_features, regularization="ridge", alpha=1000),
"random_forest": SelectionMethod.TreeBased(num_features),
"xgboost_classif": SelectionMethod.TreeBased(num_features, estimator=XGBClassifier(**tree_params)),
"xgboost_regress": SelectionMethod.TreeBased(num_features, estimator=XGBRegressor(**tree_params))
}
score_df, selected_df, runtime_df = benchmark(selectors, data, label, cv=5)
print(score_df, "\n\n", selected_df, "\n\n", runtime_df)
score_df, selected_df, runtime_df = benchmark(selectors, data, label, cv=5, n_jobs=4)
print(score_df, "\n\n", selected_df, "\n\n", runtime_df)
stats_df = calculate_statistics(score_df, selected_df)
print(stats_df)
Text-based Selection
This example shows how to use text-based selection. In this scenario, we would like to select a subset of articles that is most diverse in the text embedding space and covers a range of topics.
import pandas as pd
from feature.selector import Selective, SelectionMethod
from textwiser import TextWiser, Embedding, Transformation
data = pd.DataFrame({"article_1": ["article text here"],
"article_2": ["article text here"],
"article_3": ["article text here"],
"article_4": ["article text here"],
"article_5": ["article text here"]})
labels = pd.DataFrame({"article_1": [1, 1, 0, 1],
"article_2": [0, 1, 0, 0],
"article_3": [0, 0, 1, 0],
"article_4": [0, 0, 1, 1],
"article_5": [1, 1, 1, 0]},
index=["label_1", "label_2", "label_3", "label_4"])
textwiser = TextWiser(Embedding.TfIdf(), Transformation.NMF(n_components=20))
selector = Selective(SelectionMethod.TextBased(num_features=2, featurization_method=textwiser))
subset = selector.fit_transform(data, labels)
print("Reduction:", list(subset.columns))
Visualization
import pandas as pd
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from feature.selector import SelectionMethod, Selective, plot_importance
data, label = get_data_label(fetch_california_housing())
selector = Selective(SelectionMethod.Linear(num_features=8, regularization="none"))
subset = selector.fit_transform(data, label)
df = pd.DataFrame(selector.get_absolute_scores(), index=data.columns)
plot_importance(df)
Installation
Selective requires Python 3.7+ and can be installed from PyPI using pip install selective
.
Source
Alternatively, you can build a wheel package on your platform from scratch using the source code:
git clone https://github.com/fidelity/selective.git
cd selective
pip install setuptools wheel
python setup.py sdist bdist_wheel
pip install dist/selective-X.X.X-py3-none-any.whl
Test your setup
cd selective
python -m unittest discover tests
Support
Please submit bug reports and feature requests as Issues.
License
Selective is licensed under the GNU GPL 3.0.