BlueCast
A lightweight and fast auto-ml library, that helps data scientists
tackling real world problems from EDA to model explainability
and even uncertainty quantification.
BlueCast focuses on a few model architectures (on default Xgboost
only) and a few preprocessing options (only what is
needed for Xgboost). This allows for a much faster development
cycle and a much more stable codebase while also having as few dependencies
as possible for the library. Despite being lightweight in its core BlueCast
offers high customization options for advanced users. Find
the full documentation here.
Here you can see our test coverage in more detail:
Philosophy
There are plenty of excellent automl solutions available.
With BlueCast we don't follow the usual path ("Give me your data, we return the
best model ensemble out of X algorithms"), but have the real world data
scientist in mind. Our philosophy can be summarized as such:
- automl should not be a black box
- automl shall be a help rather than a replacement
- automl shall not be a closed system
- automl should be easy to install
- explainability over another after comma digit in precision
- real world value over pure performance
We support our users with an end-to-end toolkit, allowing fast and rich EDA,
modelling at highest convenience, explainability, evaluation and even
uncertainty quantification.
What BlueCast has to offer
Basic usage
from bluecast.blueprints.cast import BlueCast
automl = BlueCast(
class_problem="binary",
)
automl.fit(df_train, target_col="target")
y_probs, y_classes = automl.predict(df_val)
y_probs = automl.predict_proba(df_val)
Convenience features
Despite being a lightweight library, BlueCast also includes some convenience
with the following features:
- rich library of EDA functions to visualize and understand the data
- plenty of customization options via an open API
- inbuilt uncertainty quantification framework (conformal prediction)
- hyperparameter tuning (with lots of customization available)
- automatic feature type detection and casting
- automatic DataFrame schema detection: checks if unseen data has new or
missing columns
- categorical feature encoding (target encoding or directly in Xgboost)
- datetime feature encoding
- automated GPU availability check and usage for Xgboost
a fit_eval method to fit a model and evaluate it on a validation set
to mimic production environment reality
- functions to save and load a trained pipeline
- shapley values
- ROC AUC curve & lift chart
- warnings for potential misconfigurations
The fit_eval method can be used like this:
from bluecast.blueprints.cast import BlueCast
automl = BlueCast(
class_problem="binary",
)
automl.fit_eval(df_train, df_eval, y_eval, target_col="target")
y_probs, y_classes = automl.predict(df_val)
It is important to note that df_train contains the target column while
df_eval does not. The target column is passed separately as y_eval.
Kaggle competition results and example notebooks
Even though BlueCast has been designed to be a lightweight
automl framework, it still offers the possibilities to
reach very good performance. We tested BlueCast in Kaggle
competitions to showcase the libraries capabilities
feature- and performance-wise.
- ICR top 20% finish with over 6000 participants (notebook)
- An advanced example covering lots of functionalities (notebook)
- PS3E23: Predict software defects top 12% finish (notebook)
- PS3E25: Predict hardness of steel via regression (notebook)
- PS4E1: Bank churn top 13% finish (notebook)
- A comprehensive guide about BlueCast showing many capabilities (notebook)
- BlueCast using a custom Catboost model for quantile regression
and adding conformal prediction (notebook)
- 26th place in the Kaggle 24h "AutoMl" GrandPrix July 2024 blitz competition (notebook)
Please note that some notebooks ran older versions of BlueCast and
might not be compatible with the most recent version anymore.
About the code
Code quality
To ensure code quality, we use the following tools:
- various pre-commit libraries
- strong type hinting in the code base
- unit tests using Pytest
For contributors, it is expected that all pre-commit and unit tests pass.
For new features it is expected that unit tests are added.
Documentation
Documentation is provided via Read the Docs
On GitHub we offer multiple ReadMes to cover all aspects of working
with BlueCast, covering:
How to contribute
Contributions are welcome. Please follow the following steps:
- Get in touch with me (i.e. via LinkedIn) if longer contribution is of interest
- Create a new branch from develop branch
- Add your feature or fix
- Add unit tests for new features
- Run pre-commit checks and unit tests (using Pytest)
- Adjust the
docs/source/index.md
file - Copy paste the content of the
docs/source/index.md
file into the
README.md
file - Push your changes and create a pull request
If library or dev dependencies have to be changed, adjust the pyproject.toml.
For readthedocs it is also requited to update the
docs/srtd_requirements.txt
file. Simply run:
poetry export --with dev -f requirements.txt --output docs/rtd_requirements.txt
If readthedocs will be able to create the documentation can be tested via:
poetry run sphinx-autobuild docs/source docs/build/html
This will show a localhost link containing the documentation.
Supports us
Being a small open source project we rely on the community. Please
consider giving us a GitHb star and spread the word. Also your feedback
will help the project evolving.
Meta
Creator: Thomas Meißner – LinkedIn