Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
A wrapper toolbox that provides compatibility layers between TPOT and Auto-Sklearn and OpenML
Arbok (A\ utoml w\ r\ apper tool\ b\ ox for o\ penml c\ ompatibility) provides wrappers for TPOT and Auto-Sklearn, as a compatibility layer between these tools and OpenML.
The wrapper extends Sklearn’s BaseSearchCV
and provides all the
internal parameters that OpenML needs, such as cv_results_
,
best_index_
, best_params_
, best_score_
and classes_
.
::
pip install arbok
.. code:: python
import openml
from arbok import AutoSklearnWrapper, TPOTWrapper
task = openml.tasks.get_task(31)
dataset = task.get_dataset()
# Get the AutoSklearn wrapper and pass parameters like you would to AutoSklearn
clf = AutoSklearnWrapper(
time_left_for_this_task=3600, per_run_time_limit=360
)
# Or get the TPOT wrapper and pass parameters like you would to TPOT
clf = TPOTWrapper(
generations=100, population_size=100, verbosity=2
)
# Execute the task
run = openml.runs.run_model_on_task(task, clf)
run.publish()
print('URL for run: %s/run/%d' % (openml.config.server, run.run_id))
To make the wrapper more robust, we need to preprocess the data. We can fill the missing values, and one-hot encode categorical data.
First, we get a mask that tells us whether a feature is a categorical feature or not.
.. code:: python
dataset = task.get_dataset()
_, categorical = dataset.get_data(return_categorical_indicator=True)
categorical = categorical[:-1] # Remove last index (which is the class)
Next, we setup a pipeline for the preprocessing. We are using a
ConditionalImputer
, which is an imputer which is able to use
different strategies for categorical (nominal) and numerical data.
.. code:: python
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from arbok import ConditionalImputer
preprocessor = make_pipeline(
ConditionalImputer(
categorical_features=categorical,
strategy="mean",
strategy_nominal="most_frequent"
),
OneHotEncoder(
categorical_features=categorical, handle_unknown="ignore", sparse=False
)
)
And finally, we put everything together in one of the wrappers.
.. code:: python
clf = AutoSklearnWrapper(
preprocessor=preprocessor, time_left_for_this_task=3600, per_run_time_limit=360
)
Limitations
- Currently only the classifiers are implemented. Regression is
therefore not possible.
- For TPOT, the ``config_dict`` variable can not be set, because this
causes problems with the API.
Benchmarking
------------
Installing the ``arbok`` package includes the ``arbench`` cli tool. We
can generate a json file like this:
.. code:: python
from arbok.bench import Benchmark
bench = Benchmark()
config_file = bench.create_config_file(
# Wrapper parameters
wrapper={"refit": True, "verbose": False, "retry_on_error": True},
# TPOT parameters
tpot={
"max_time_mins": 6, # Max total time in minutes
"max_eval_time_mins": 1 # Max time per candidate in minutes
},
# Autosklearn parameters
autosklearn={
"time_left_for_this_task": 360, # Max total time in seconds
"per_run_time_limit": 60 # Max time per candidate in seconds
}
)
And then, we can call arbench like this:
.. code:: bash
arbench --classifier autosklearn --task-id 31 --config config.json
Or calling arbok as a python module:
.. code:: bash
python -m arbok --classifier autosklearn --task-id 31 --config config.json
Running a benchmark on batch systems
------------------------------------
To run a large scale benchmark, we can create a configuration file like
above, and generate and submit jobs to a batch system as follows.
.. code:: python
# We create a benchmark setup where we specify the headers, the interpreter we
# want to use, the directory to where we store the jobs (.sh-files), and we give
# it the config-file we created earlier.
bench = Benchmark(
headers="#PBS -lnodes=1:cpu3\n#PBS -lwalltime=1:30:00",
python_interpreter="python3", # Path to interpreter
root="/path/to/project/",
jobs_dir="jobs",
config_file="config.json",
log_file="log.json"
)
# Create the config file like we did in the section above
config_file = bench.create_config_file(
# Wrapper parameters
wrapper={"refit": True, "verbose": False, "retry_on_error": True},
# TPOT parameters
tpot={
"max_time_mins": 6, # Max total time in minutes
"max_eval_time_mins": 1 # Max time per candidate in minutes
},
# Autosklearn parameters
autosklearn={
"time_left_for_this_task": 360, # Max total time in seconds
"per_run_time_limit": 60 # Max time per candidate in seconds
}
)
# Next, we load the tasks we want to benchmark on from OpenML.
# In this case, we load a list of task id's from study 99.
tasks = openml.study.get_study(99).tasks
# Next, we create jobs for both tpot and autosklearn.
bench.create_jobs(tasks, classifiers=["tpot", "autosklearn"])
# And finally, we submit the jobs using qsub
bench.submit_jobs()
Preprocessing parameters
------------------------
.. code:: python
from arbok import ParamPreprocessor
import numpy as np
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import make_pipeline
X = np.array([
[1, 2, True, "foo", "one"],
[1, 3, False, "bar", "two"],
[np.nan, "bar", None, None, "three"],
[1, 7, 0, "zip", "four"],
[1, 9, 1, "foo", "five"],
[1, 10, 0.1, "zip", "six"]
], dtype=object)
# Manually specify types, or use types="detect" to automatically detect types
types = ["numeric", "mixed", "bool", "nominal", "nominal"]
pipeline = make_pipeline(ParamPreprocessor(types="detect"), VarianceThreshold())
pipeline.fit_transform(X)
Output:
::
[[-0.4472136 -0.4472136 1.41421356 -0.70710678 -0.4472136 -0.4472136
2.23606798 -0.4472136 -0.4472136 -0.4472136 0.4472136 -0.4472136
-0.85226648 1. ]
[-0.4472136 2.23606798 -0.70710678 -0.70710678 -0.4472136 -0.4472136
-0.4472136 -0.4472136 -0.4472136 2.23606798 0.4472136 -0.4472136
-0.5831297 -1. ]
[ 2.23606798 -0.4472136 -0.70710678 -0.70710678 -0.4472136 -0.4472136
-0.4472136 -0.4472136 2.23606798 -0.4472136 -2.23606798 2.23606798
-1.39054004 -1. ]
[-0.4472136 -0.4472136 -0.70710678 1.41421356 -0.4472136 2.23606798
-0.4472136 -0.4472136 -0.4472136 -0.4472136 0.4472136 -0.4472136
0.49341743 -1. ]
[-0.4472136 -0.4472136 1.41421356 -0.70710678 2.23606798 -0.4472136
-0.4472136 -0.4472136 -0.4472136 -0.4472136 0.4472136 -0.4472136
1.031691 1. ]
[-0.4472136 -0.4472136 -0.70710678 1.41421356 -0.4472136 -0.4472136
-0.4472136 2.23606798 -0.4472136 -0.4472136 0.4472136 -0.4472136
1.30082778 1. ]]
FAQs
A wrapper toolbox that provides compatibility layers between TPOT and Auto-Sklearn and OpenML
We found that arbok demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.