Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
An end to end solution for automl.
Pass in your data, add some information about it and get a full pipelines in return. Data preprocessing, feature creation, modelling and evaluation with just a few lines of code.
From PyPI:
pip install e2eml
We highly recommend to create a new virtual environment first. Then install e2e-ml into it. In the environment also download the pretrained spacy model with. Otherwise e2eml will do this automatically during runtime.
e2eml can also be installed into a RAPIDS environment. For this we recommend to create a fresh environment following RAPIDS instructions. After environment installation and activation, a special installation is needed to not run into installation issues.
Just run:
pip install e2eml[rapids]
This will additionally install cupy and cython to prevent issues. Additionally it is needed to follow Pytorch installation instructions. When installing RAPIDs, Pytorch & Spacy for GPU, it is recommended to look for supported Cuda versions in all three. If Pytorch related parts fail on runtime, it is recommended to reinstall a new environment and install Pytorch using pip rather than conda.
# also spacy supports GPU acceleration
pip install -U spacy[cuda112] #cuda112 depends on your actual cuda version, see: https://spacy.io/usage
Otherwise Pytorch will fail trying to run on GPU.
If e2eml shall be installed together with Jupyter core and ipython, please install with:
pip install e2eml[full]
instead.
e2e has been designed to create state-of-the-art machine learning pipelines with a few lines of code. Basic example of usage:
import e2eml
from e2eml.classification import classification_blueprints
import pandas as pd
# import data
df = pd.read_csv("Your.csv")
# split into a test/train & holdout set (holdout for prediction illustration here, but not required at all)
train_df = df.head(1000).copy()
holdout_df = df.tail(200).copy() # make sure
# saving the holdout dataset's target for later and delete it from holdout dataset
target = "target_column"
holdout_target = holdout_df[target].copy()
del holdout_df[target]
# instantiate the needed blueprints class
from classification import classification_blueprints # regression bps are available with from regression import regression_blueprints
test_class = classification_blueprints.ClassificationBluePrint(datasource=train_df,
target_variable=target,
train_split_type='cross',
rapids_acceleration=True, # if installed into a conda environment with NVIDIA Rapids, this can be used to accelerate preprocessing with GPU
preferred_training_mode='auto', # Auto will automatically identify, if LGBM & Xgboost can use GPU acceleration*
tune_mode='accurate' # hyperparameter sets will be validated with 10-fold CV Set this to 'simple' for 1-fold CV
#categorical_columns=cat_columns # you can define categorical columns, otherwise e2e does this automatically
#date_columns=date_columns # you can also define date columns (expected is YYYY-MM-DD format)
)
"""
*
'Auto' is recommended for preferred_training_mode parameter, but with 'CPU' and 'GPU' it can also be controlled manually.
If you install Xgboost & LGBM into the same environment as GPU accelerated versions, you can set preferred_training_mode='gpu'.
This will massively improve training times and speed up SHAP feature importance for LGBM and Xgboost related tasks.
For Xgboost this should work out of the box, if installed into a RAPIDS environment.
"""
# run actual blueprint
test_class.ml_bp01_multiclass_full_processing_xgb_prob()
"""
When choosing blueprints several options are available:
Multiclass blueprints can handle binary and multiclass tasks:
- ml_bp00_train_test_binary_full_processing_log_reg_prob()
- ml_bp01_multiclass_full_processing_xgb_prob()
- ml_bp02_multiclass_full_processing_lgbm_prob()
- ml_bp03_multiclass_full_processing_sklearn_stacking_ensemble()
- ml_bp04_multiclass_full_processing_ngboost()
- ml_bp05_multiclass_full_processing_vowpal_wabbit
- ml_bp06_multiclass_full_processing_bert_transformer() # for NLP specifically
- ml_bp07_multiclass_full_processing_tabnet()
- ml_bp08_multiclass_full_processing_ridge()
- ml_bp09_multiclass_full_processing_catboost()
- ml_bp10_multiclass_full_processing_sgd()
- ml_bp11_multiclass_full_processing_quadratic_discriminant_analysis()
- ml_bp12_multiclass_full_processing_svm()
- ml_bp13_multiclass_full_processing_multinomial_nb()
- ml_bp14_multiclass_full_processing_lgbm_focal()
- ml_bp16_multiclass_full_processing_neural_network() # offers fully connected ANN & 1D CNN
- ml_special_binary_full_processing_boosting_blender()
- ml_special_multiclass_auto_model_exploration()
- ml_special_multiclass_full_processing_multimodel_max_voting()
There are regression blueprints as well (in regression module):
- ml_bp10_train_test_regression_full_processing_linear_reg()
- ml_bp11_regression_full_processing_xgboost()
- ml_bp12_regressions_full_processing_lgbm()
- ml_bp13_regression_full_processing_sklearn_stacking_ensemble()
- ml_bp14_regressions_full_processing_ngboost()
- ml_bp15_regression_full_processing_vowpal_wabbit_reg()
- ml_bp16_regressions_full_processing_bert_transformer()
- ml_bp17_regression_full_processing_tabnet_reg()
- ml_bp18_regression_full_processing_ridge_reg()
- ml_bp19_regression_full_processing_elasticnet_reg()
- ml_bp20_regression_full_processing_catboost()
- ml_bp20_regression_full_processing_sgd()
- ml_bp21_regression_full_processing_ransac()
- ml_bp22_regression_full_processing_svm()
- ml_bp23_regressions_full_processing_neural_network() # offers fully connected ANN & 1D CNN
- ml_special_regression_full_processing_multimodel_avg_blender()
- ml_special_regression_auto_model_exploration()
In the time series module we recently embedded blueprints as well:
- ml_bp100_univariate_timeseries_full_processing_auto_arima()
- ml_bp101_multivariate_timeseries_full_processing_lstm()
- ml_bp102_multivariate_timeseries_full_processing_tabnet()
- ml_bp103_multivariate_timeseries_full_processing_rnn()
- ml_bp104_univariate_timeseries_full_processing_holt_winters()
Time series blueprints use less preprocessing on default and cannot use all options like
classification and regression models. Non-time series algorithms like TabNet are different
to their regression counterpart as cross validation is replaced by time series splits and
data scaling covers the target variable as well.
In ensembles algorithms can be chosen via the class attribute:
test_class.special_blueprint_algorithms = {"ridge": True,
"elasticnet": False,
"xgboost": True,
"ngboost": True,
"lgbm": True,
"tabnet": False,
"vowpal_wabbit": True,
"sklearn_ensemble": True,
"catboost": False
}
Also preprocessing steps can be selected:
test_class.blueprint_step_selection_non_nlp = {
"automatic_type_detection_casting": True,
"remove_duplicate_column_names": True,
"reset_dataframe_index": True,
"fill_infinite_values": True,
"early_numeric_only_feature_selection": True,
"delete_high_null_cols": True,
"data_binning": True,
"regex_clean_text_data": False,
"handle_target_skewness": False,
"datetime_converter": True,
"pos_tagging_pca": False, # slow with many categories
"append_text_sentiment_score": False,
"tfidf_vectorizer_to_pca": False, # slow with many categories
"tfidf_vectorizer": False,
"rare_feature_processing": True,
"cardinality_remover": True,
"categorical_column_embeddings": False,
"holistic_null_filling": True, # slow
"numeric_binarizer_pca": True,
"onehot_pca": True,
"category_encoding": True,
"fill_nulls_static": True,
"autoencoder_outlier_detection": True,
"outlier_care": True,
"delete_outliers": False,
"remove_collinearity": True,
"skewness_removal": True,
"automated_feature_transformation": False,
"random_trees_embedding": False,
"clustering_as_a_feature_dbscan": True,
"clustering_as_a_feature_kmeans_loop": True,
"clustering_as_a_feature_gaussian_mixture_loop": True,
"pca_clustering_results": True,
"svm_outlier_detection_loop": False,
"autotuned_clustering": False,
"reduce_memory_footprint": False,
"scale_data": True,
"smote": False,
"automated_feature_selection": True,
"bruteforce_random_feature_selection": False, # slow
"autoencoder_based_oversampling": False,
"synthetic_data_augmentation": False,
"final_pca_dimensionality_reduction": False,
"final_kernel_pca_dimensionality_reduction": False,
"delete_low_variance_features": False,
"shap_based_feature_selection": False,
"delete_unpredictable_training_rows": False,
"trained_tokenizer_embedding": False,
"sort_columns_alphabetically": True,
"use_tabular_gan": False,
}
The bruteforce_random_feature_selection step is experimental. It showed promising results. The number of trials can be controlled.
This step is useful, if the model overfitted (which should happen rarely), because too many features with too little
feature importance have been considered.
like test_class.hyperparameter_tuning_rounds["bruteforce_random"] = 400 .
Generally the class instance is a control center and gives room for plenty of customization.
Never update the class attributes like shown below.
test_class.tabnet_settings = "batch_size": rec_batch_size,
"virtual_batch_size": virtual_batch_size,
# pred batch size?
"num_workers": 0,
"max_epochs": 1000}
test_class.hyperparameter_tuning_rounds = {
"xgboost": 100,
"lgbm": 500,
"lgbm_focal": 50,
"tabnet": 25,
"ngboost": 25,
"sklearn_ensemble": 10,
"ridge": 500,
"elasticnet": 100,
"catboost": 25,
"sgd": 2000,
"svm": 50,
"svm_regression": 50,
"ransac": 50,
"multinomial_nb": 100,
"bruteforce_random": 400,
"synthetic_data_augmentation": 100,
"autoencoder_based_oversampling": 200,
"final_kernel_pca_dimensionality_reduction": 50,
"final_pca_dimensionality_reduction": 50,
"auto_arima": 50,
"holt_winters": 50,
}
test_class.hyperparameter_tuning_max_runtime_secs = {
"xgboost": 2 * 60 * 60,
"lgbm": 2 * 60 * 60,
"lgbm_focal": 2 * 60 * 60,
"tabnet": 2 * 60 * 60,
"ngboost": 2 * 60 * 60,
"sklearn_ensemble": 2 * 60 * 60,
"ridge": 2 * 60 * 60,
"elasticnet": 2 * 60 * 60,
"catboost": 2 * 60 * 60,
"sgd": 2 * 60 * 60,
"svm": 2 * 60 * 60,
"svm_regression": 2 * 60 * 60,
"ransac": 2 * 60 * 60,
"multinomial_nb": 2 * 60 * 60,
"bruteforce_random": 2 * 60 * 60,
"synthetic_data_augmentation": 1 * 60 * 60,
"autoencoder_based_oversampling": 2 * 60 * 60,
"final_kernel_pca_dimensionality_reduction": 4 * 60 * 60,
"final_pca_dimensionality_reduction": 2 * 60 * 60,
"auto_arima": 2 * 60 * 60,
"holt_winters": 2 * 60 * 60,
}
When these parameters have to updated, please overwrite the keys individually to not break the blueprints eventually.
I.e.: test_class.hyperparameter_tuning_max_runtime_secs["xgboost"] = 12*60*60 would work fine.
Working with big data can bring all hardware to it's needs. e2eml has been tested with:
- Ryzen 5950x (16 cores CPU)
- Geforce RTX 3090 (24GB VRAM)
- 64GB RAM
e2eml has been able to process 100k rows with 200 columns approximately using these specs stable for non-blended
blueprints. Blended blueprints consume more resources as e2eml keep the trained models in memory as of now.
For data bigger than 100k rows it is possible to limit the amount of data for various preprocessing steps:
- test_class.feature_selection_sample_size = 100000 # for feature selection
- test_class.hyperparameter_tuning_sample_size = 100000 # for model hyperparameter optimization
- test_class.brute_force_selection_sample_size = 15000 # for an experimental feature selection
For binary classification a sample size of 100k datapoints is sufficient in most cases.
Hyperparameter tuning sample size can be much less,
depending on class imbalance.
For multiclass we recommend to start with small samples as algorithms like Xgboost and LGBM will
easily grow in memory consumption
with growing number of classes. LGBM focal or neural network will be good starts here.
Whenever classes are imbalanced (binary & multiclass) we recommend to use the preprocessing step
"autoencoder_based_oversampling".
"""
# After running the blueprint the pipeline is done. I can be saved with:
save_to_production(test_class, file_name='automl_instance')
# The blueprint can be loaded with
loaded_test_class = load_for_production(file_name='automl_instance')
# predict on new data (in this case our holdout) with loaded blueprint
loaded_test_class.ml_bp01_multiclass_full_processing_xgb_prob(holdout_df)
# predictions can be accessed via a class attribute
print(churn_class.predicted_classes['xgboost'])
This project uses pre-commit to enforce style.
To install the pre-commit hooks, first install pre-commit into the project's virtual environment:
pip install pre-commit
Then install the project hooks:
pre-commit install
Now, whenever you make a commit, the linting and autoformatting will automatically run.
e2e is not designed to quickly iterate over several algorithms and suggest you the best. It is made to deliver state-of-the-art performance as ready-to-go blueprints. e2e-ml blueprints contain:
This comes at the cost of runtime. Depending on your data we recommend strong hardware.
This project uses poetry.
To install the project for development, run:
poetry install
This will install all dependencies and development dependencies into a virtual environment.
To add or remove a dependency, use poetry add <package>
or
poetry remove <package>
respectively. Use the --dev
flag for development
dependencies.
To build and publish the project, run
poetry publish --build
This project comes with documentation. To build the docs, run:
cd docs
make docs
You may then browse the HTML docs at docs/build/docs/index.html
.
We welcome Pull Requests! Please make a PR against the develop
branch.
Creator: Thomas Meißner – LinkedIn
Consultant: Gabriel Stephen Alexander – Github
Special thanks to: Alex McKenzie - LinkedIn
FAQs
An end-to-end solution for automl
We found that e2eml demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.