Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socketโs threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
๐ฅ FeatureWiz, the ultimate feature selection library is powered by the renowned Minimum Redundancy Maximum Relevance (MRMR) algorithm. Learn more about it below.
featurewiz
5.0 version is out! It contains brand new Deep Learning Auto Encoders to enrich your data for the toughest imbalanced and multi-class datasets. In addition, it has multiple brand-new Classifiers built for imbalanced and multi-class problems such as the IterativeDoubleClassifier
and the BlaggingClassifier
. If you are looking for the latest and greatest updates about our library, check out our updates page.
If you use featurewiz in your research project or paper, please use the following format for citations:
"Seshadri, Ram (2020). GitHub - AutoViML/featurewiz: Use advanced feature engineering strategies and select the best features from your data set fast with a single line of code. source code: https://github.com/AutoViML/featurewiz"
Current citations for featurewizGoogle Scholar citations for featurewiz
featurewiz
is the best feature selection library for boosting your machine learning performance with minimal effort and maximum relevance using the famous MRMR algorithm.
โ๏ธ Automatically select the most relevant features without specifying a number ๐ Fast and user-friendly, perfect for data scientists at all levels ๐ฏ Provides a built-in categorical-to-numeric encoder ๐ Well-documented with plenty of examples ๐ Actively maintained and regularly updated
๐ First create additional features using the feature engg module ๐ Compare featurewiz against other feature selection methods for best performance โ๏ธ Avoid overfitting by cross-validating your results as shown here ๐ฏ Try adding auto-encoders for additional features that may help boost performance
Create new features effortlessly with a single line of code. featurewiz enables you to generate hundreds of interaction, group-by, or target-encoded features, eliminating the need for expert-level skills.
featurewiz provides one of the best automatic feature selection algorithms, MRMR, described by wikipedia in this page as follows: "The MRMR feature selection algorithm has been found to be more powerful than the maximum relevance feature selection algorithm" Boruta.
After creating new features, featurewiz uses the MRMR algorithm to answer crucial questions: Which features are important? Are they redundant or multi-correlated? Does your model suffer from or benefit from these new features? To answer these questions, two more steps are needed: โ๏ธ SULOV Algorithm: The "Searching for Uncorrelated List of Variables" method ensures you're left with the most relevant, non-redundant features. โ๏ธ Recursive XGBoost: featurewiz leverages XGBoost to repeatedly identify the best features among the selected variables after SULOV.
featurewiz extends beyond traditional feature selection by including powerful feature engineering capabilities such as:
featurewiz
has two major modules to transform your Data Science workflow:
1. Feature Engineering Module
2. Feature Selection Module
Featurewiz uses what is known as a Minimal Optimal
algorithm while Boruta uses an All-Relevant
algorithm. To understand how featurewiz's MRMR approach differs Boruta for comprehensive feature selection you need to see the chart below. It shows how the SULOV algorithm performs MRMR feature selection which provides a smaller feature set compared to Boruta. Additionally, Boruta contains redundant features (highly correlated features) which will hamper model performance while featurewiz doesn't.
Transform your feature engineering and selection process with featurewiz - the tool that brings expert-level capabilities to your fingertips!
featurewiz
performs feature selection in 2 steps. Each step is explained below.
The working of the SULOV
algorithm is as follows:
The working of the Recursive XGBoost is as follows: Once SULOV has selected variables that have high mutual information scores with the least correlation among them, featurewiz uses XGBoost to repeatedly find the best features among the remaining variables after SULOV.
Here are some additional tips for ML engineers and data scientists when using featurewiz:
Prerequisites:
In Kaggle notebooks, you need to install featurewiz like this (otherwise there will be errors):
!pip install featurewiz
!pip install Pillow==9.0.0
!pip install xlrd โ ignore-installed โ no-deps
!pip install executing>0.10.0
To install from source:
cd <featurewiz_Destination>
git clone git@github.com:AutoViML/featurewiz.git
# or download and unzip https://github.com/AutoViML/featurewiz/archive/master.zip
conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd featurewiz
pip install -r requirements.txt
As of June 2022, thanks to arturdaraujo, featurewiz is now available on conda-forge. You can try:
conda install -c conda-forge featurewiz
!pip install git+https://github.com/AutoViML/featurewiz.git
There are two ways to use featurewiz.
from featurewiz import FeatureWiz
fwiz = FeatureWiz(feature_engg = '', nrows=None, transform_target=True, scalers="std",
category_encoders="auto", add_missing=False, verbose=0, imbalanced=False,
ae_options={})
X_train_selected, y_train = fwiz.fit_transform(X_train, y_train)
X_test_selected = fwiz.transform(X_test)
### get list of selected features ###
fwiz.features
import featurewiz as fwiz
outputs = fwiz.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',',
header=0, test_data='',feature_engg='', category_encoders='',
dask_xgboost_flag=False, nrows=None, skip_sulov=False, skip_xgboost=False)
outputs
is a tuple: There will always be two objects in output. It can vary:
features
and trainm
: features is a list (of selected features) and trainm is the transformed dataframe (if you sent in train only)trainm
and testm
: It can be two transformed dataframes when you send in both test and train but with selected features.In both cases, the features and dataframes are ready for you to do further modeling.
Featurewiz works on any multi-class, multi-label data Set. So you can have as many target labels as you want. You don't have to tell Featurewiz whether it is a Regression or Classification problem. It will decide that automatically.
Input Arguments for NEW syntax
Parameters
----------
corr_limit : float, default=0.90
The correlation limit to consider for feature selection. Features with correlations
above this limit may be excluded.
verbose : int, default=0
Level of verbosity in output messages.
feature_engg : str or list, default=''
Specifies the feature engineering methods to apply, such as 'interactions', 'groupby',
and 'target'.
auto_encoders : str or list, default=''
Five new options have been added recently to `auto_encoders` (starting in version 0.5.0): `DAE`, `VAE`, `DAE_ADD`, `VAE_ADD`, `CNN`, `CNN_ADD` and `GAN`. These are deep learning auto encoders (using tensorflow and keras) that can extract the most important patterns in your data and either replace your features or add them as extra features to your data. Try them for your toughest ML problems! See the notebooks folder for examples.
ae_options : dict, default={}
You can provide a dictionary for tuning auto encoders above. Supported auto encoders include 'dae',
'vae', and 'gan'. You must use the `help` function to see how to send a dict to each auto encoder. You can also check out this <a href="https://github.com/AutoViML/featurewiz/blob/main/examples/Featurewiz_with_AutoEncoder_Demo.ipynb">Auto Encoder demo notebook</a>
category_encoders : str or list, default=''
Encoders for handling categorical variables. Supported encoders include 'onehot',
'ordinal', 'hashing', 'count', 'catboost', 'target', 'glm', 'sum', 'woe', 'bdc',
'loo', 'base', 'james', 'helmert', 'label', 'auto', etc.
add_missing : bool, default=False
If True, adds indicators for missing values in the dataset.
dask_xgboost_flag : bool, default=False
If set to True, enables the use of Dask for parallel computing with XGBoost.
nrows : int or None, default=None
Limits the number of rows to process.
skip_sulov : bool, default=False
If True, skips the application of the Super Learning Optimized (SULO) method in
feature selection.
skip_xgboost : bool, default=False
If True, bypasses the recursive XGBoost feature selection.
transform_target : bool, default=False
When True, transforms the target variable(s) into numeric format if they are not
already.
scalers : str or None, default=None
Specifies the scaler to use for feature scaling. Available options include
'std', 'standard', 'minmax', 'max', 'robust', 'maxabs'.
imbalanced : True or False, default=False
Specifies whether to use SMOTE technique for imbalanced datasets.
Input Arguments for old syntax
dataname
: could be a datapath+filename or a dataframe. It will detect whether your input is a filename or a dataframe and load it automatically.target
: name of the target variable in the data set.corr_limit
: if you want to set your own threshold for removing variables as highly correlated, then give it here. The default is 0.9 which means variables less than -0.9 and greater than 0.9 in pearson's correlation will be candidates for removal.verbose
: This has 3 possible states:
0
- limited output. Great for running this silently and getting fast results.1
- verbose. Great for knowing how results were and making changes to flags in input.2
- more charts such as SULOV and output. Great for finding out what happens under the hood for SULOV method.test_data
: This is only applicable to the old syntax if you want to transform both train and test data at the same time in the same way. test_data
could be the name of a datapath+filename or a dataframe. featurewiz will detect whether your input is a filename or a dataframe and load it automatically. Default is empty string.dask_xgboost_flag
: default False. If you want to use dask with your data, then set this to True.feature_engg
: You can let featurewiz select its best encoders for your data set by setting this flag
for adding feature engineering. There are three choices. You can choose one, two, or all three.
interactions
: This will add interaction features to your data such as x1x2, x2x3, x12, x22, etc.groupby
: This will generate Group By features to your numeric vars by grouping all categorical vars.target
: This will encode and transform all your categorical features using certain target encoders.add_missing
: default is False. This is a new flag: the add_missing
flag will add a new column for missing values for all your variables in your dataset. This will help you catch missing values as an added signal.category_encoders
: default is "auto". Instead, you can choose your own category encoders from the list below.
We recommend you do not use more than two of these. Featurewiz will automatically select only two if you have more than two in your list. You can set "auto" for our own choice or the empty string "" (which means no encoding of your categorical features)HashingEncoder
: HashingEncoder is a multivariate hashing implementation with configurable dimensionality/precision. The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.SumEncoder
: SumEncoder is a Sum contrast coding for the encoding of categorical features.PolynomialEncoder
: PolynomialEncoder is a Polynomial contrast coding for the encoding of categorical features.BackwardDifferenceEncoder
: BackwardDifferenceEncoder is a Backward difference contrast coding for encoding categorical variables.OneHotEncoder
: OneHotEncoder is the traditional Onehot (or dummy) coding for categorical features. It produces one feature per category, each being a binary.HelmertEncoder
: HelmertEncoder uses the Helmert contrast coding for encoding categorical features.OrdinalEncoder
: OrdinalEncoder uses Ordinal encoding to designate a single column of integers to represent the categories in your data. Integers however start in the same order in which the categories are found in your dataset. If you want to change the order, just sort the column and send it in for encoding.FrequencyEncoder
: FrequencyEncoder is a count encoding technique for categorical features. For a given categorical feature, it replaces the names of the categories with the group counts of each category.BaseNEncoder
: BaseNEncoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), and a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.TargetEncoder
: TargetEncoder performs Target encoding for categorical features. It supports the following kinds of targets: binary and continuous. For multi-class targets, it uses a PolynomialWrapper.CatBoostEncoder
: CatBoostEncoder performs CatBoost coding for categorical features. It supports the following kinds of targets: binary and continuous. For polynomial target support, it uses a PolynomialWrapper. This is very similar to leave-one-out encoding, but calculates the values โon-the-flyโ. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.WOEEncoder
: WOEEncoder uses the Weight of Evidence technique for categorical features. It supports only one kind of target: binary. For polynomial target support, it uses a PolynomialWrapper. It cannot be used for Regression.JamesSteinEncoder
: JamesSteinEncoder uses the James-Stein estimator. It supports 2 kinds of targets: binary and continuous. For polynomial target support, it uses PolynomialWrapper.
For feature value i, James-Stein estimator returns a weighted average of:
The mean target value for the observed feature value i.
The mean target value (regardless of the feature value).nrows
: default None
. You can set the number of rows to read from your datafile if it is too large to fit into either dask or pandas. But you won't have to if you use dask.skip_sulov
: default False
. You can set the flag to skip the SULOV method if you want.skip_xgboost
: default False
. You can set the flag to skip the Recursive XGBoost method if you want.Output values for old syntax This applies only to the old syntax.
outputs
: Output is always a tuple. We can call our outputs in that tuple as out1
and out2
below.
out1
and out2
: If you sent in just one dataframe or filename as input, you will get:
features
: It will be a list (of selected features) andtrainm
: It will be a dataframe (if you sent in a file or dataname as input)out1
and out2
: If you sent in two files or dataframes (train and test), you will get:
trainm
: a modified train dataframe with engineered and selected features from dataname andtestm
: a modified test dataframe with engineered and selected features from test_data.To learn more about how featurewiz works under the hood, watch this video
featurewiz was designed for selecting High Performance variables with the fewest steps.
In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).
featurewiz is every Data Scientist's feature wizard that will:
*** Special thanks to fellow open source Contributors ***:
PRs accepted.
Apache License 2.0 ยฉ 2020 Ram Seshadri
This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.
FAQs
Select Best Features from your data set - any size - now with XGBoost!
We found that featurewiz demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.ย It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socketโs threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.