Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

autoembedder

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

autoembedder

PyTorch autoencoder with additional embeddings layer for categorical data.

  • 0.2.5
  • PyPI
  • Socket score

Maintainers
1

The autoembedder

The Autoembedder

deploy package Codacy Badge pypi python version docs license downloads mypy black isort pre-commit

Introduction

The Autoembedder is an autoencoder with additional embedding layers for the categorical columns. Its usage is flexible, and hyperparameters like the number of layers can be easily adjusted and tuned. The data provided for training can be either a path to a Dask or Pandas DataFrame stored in the Parquet format or the DataFrame object directly.

Installation

If you are using Poetry, you can install the package with the following command:

poetry add autoembedder

If you are using pip, you can install the package with the following command:

pip install autoembedder

Installing dependencies

With Poetry:

poetry install

With pip:

pip install -r requirements.txt

Usage

0. Some imports

from autoembedder import Autoembedder, dataloader, fit

1. Create dataloaders

First, we create two dataloaders. One for training, and the other for validation data. As source they either accept a path to a Parquet file, to a folder of Parquet files or a Pandas/Dask DataFrame.

train_dl = dataloader(train_df)
valid_dl = dataloader(vaild_df)

2. Set parameters

Now, we need to set the parameters. They are going to be used for handling the data and training the model. In this example, only parameters for the training are set. Here you find a list of all possible parameters. This should do it:

parameters = {
    "hidden_layers": [[25, 20], [20, 10]],
    "epochs": 10,
    "lr": 0.0001,
    "verbose": 1,
}

3. Initialize the autoembedder

Then, we need to initialize the autoembedder. In this example, we are not using any categorical features. So we can skip the embedding_sizes argument.

model = Autoembedder(parameters, num_cont_features=train_df.shape[1])

4. Train the model

Everything is set up. Now we can fit the model.

fit(parameters, model, train_dl, valid_dl)

Example

Check out this Jupyter notebook for an applied example using the Credit Card Fraud Detection from Kaggle.

Parameters

This is a list of all parameters that can be passed to the Autoembedder for training. When using the training script the _ needs to be replaced with - and the parameters need to be passed as arguments. For boolean values please have a look at the Comment column for understanding how to pass them.

Run the training script

You can also simply use the training script::

python3 training.py \
--epochs 20 \
--train-input-path "path/to/your/train_data" \
--test-input-path "path/to/your/test_data" \
--hidden-layers "[[12, 6], [6, 3]]"

for help just run:

python3 training.py --help
ArgumentTypeRequiredDefault valueComment
batch_sizeintFalse32
drop_lastboolFalseTrue--drop-last / --no-drop-last
pin_memoryboolFalseTrue--pin-memory / --no-pin-memory
num_workersintFalse00 means that the data will be loaded in the main process
use_mpsboolFalseFalse--use-mps / --no-use-mps
model_titlestrFalseautoembedder_{datetime}.bin
model_save_pathstrFalse
n_save_checkpointsintFalse
lrfloatFalse0.001
amsgradboolFalseFalse--amsgrad / --no-amsgrad
epochsintTrue
dropout_ratefloatFalse0Dropout rate for the dropout layers in the encoder and decoder.
layer_biasboolFalseTrue--layer-bias / --no-layer-bias
weight_decayfloatFalseFalse
l1_lambdafloatFalse0
xavier_initboolFalseFalse--xavier-init / --no-xavier-init
activationstrFalsetanhActivation function; either tanh, relu, leaky_relu or elu
tensorboard_log_pathstrFalse
trim_eval_errorsboolFalseFalse--trim-eval-errors / --no-trim-eval-errors; Removes the max and min loss when calculating the mean loss diff and median loss diff. This can be useful if some rows create very high losses.
verboseintFalse0Set this to 1 if you want to see the model summary and the validation and evaluation results. set this to 2 if you want to see the training progress bar. 0 means no output.
targetstrFalseThe target column. If not set no evaluation will be performed.
train_input_pathstrTrue
test_input_pathstrTrue
eval_input_pathstrFalsePath to the evaluation data. If no path is provided no evaluation will be performed.
hidden_layersstrTrueContains a string representation of a list of list of integers which represents the hidden layer structure. E.g.: "[[64, 32], [32, 16], [16, 8]]" activation
cat_columnsstrFalse"[]"Contains a string representation of a list of list of categorical columns (strings). The columns which use the same encoder should be together in a list. E.g.: "[['a', 'b'], ['c']]".
drop-cat-columnsboolFalse--drop-cat-columns / --no-drop-cat-columns

Why additional embedding layers?

The additional embedding layers automatically embed all columns with the Pandas category data type. If categorical columns have another data type, they will not be embedded and will be handled like continuous columns. Simply encoding the categorical values (e.g., with the usage of a label encoder) decreases the quality of the outcome.

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc