Simplistic algorithms to train decision trees for regression and classification
- Create binary trees using the CART (Classification and Regression Tree) algorithm.
- Train ensembles of trees using Gradient Boosting, Adaptive Boosting (AdaBoost) or Random Forest.
- Process each datatype with a data handler as provided or impöememted to suit your needs. Just add your own implementation to the factory.
- Metrics for different kinds of outcome variables are implemented analogously.
- Features with high cardinality are treated with a simulated annealing solver to find the best combination.
- No need for dummy encoding.
- Train models using supervised or unsupervised learning.
- Specify weights for unbalanced datasets.
NOTE: These pure python (and a bit of numpy) algorithms are many times slower than, e.g., sklearn
or xgboost
pip install binarybeech[visualize]
The dependencies installed using the visualize option enable support for plotting and formatting trees.
Load the Classification And Regression Tree model class
import pandas as pd
from binarybeech.binarybeech import CART
from binarybeech.extra import k_fold_split
get the data from a csv file
df = pd.read_csv("data/titanic.csv")
[(df_train, df_test)] = k_fold_split(df,frac=0.75,random=True,replace=False)
grow a decision tree
c = CART(df=df_train,y_name="Survived", method="classification")
validation metrics
Please have a look at the jupyter notebooks in this repository for more examples. To try them out online, you can use
CART(df, y_name, X_names=None, min_leaf_samples=1, min_split_samples=1, max_depth=10, method="regression", handle_missings="simple", attribute_handlers=None)
Class for a Classification and Regression Tree (CART) model.
- Parameters
- df: pandas dataframe with training data
- y_name: name of the column with the output data/labels
- X_names: list of names with the inputs to use for the modelling. If None, all columns except y_name are chosen. Default is None.
- min_leaf_samples: If the number of training samples is lower than this, a terminal node (leaf) is created. Default is 1.
- min_split_samples: If a split of the training data is proposed with at least one branch containing less samples than this, the split is rejected. Default is 1.
- max_depth: Maximum number of sequential splits. This corresponds to the number of vertical layers of the tree. Default is 10, which corresponds to a maximum number of 1024 terminal nodes.
- method: Metrics to use for the evaluation of split loss, etc. Can be either "classification", "logistic", "regression", or None. Default is "regression". If None is chosen, the
is deduced from the training dataframe. - handle_missings: Specify the way how missing data is handeled. Can be eiter None or "simple".
- attribute_handlers: dict with attribute handler instances for each variable. The data handler determins, e.g., how splits of the dataset are made.
- Methods
- predict(df):
- Parameters:
- df: dataframe with inputs for predictions.
- Returns:
- array with predicted values/labels.
- train(k=5, plot=True, slack=1.0):
- Parameters:
- k: number of different splits of the dataframe into training and test sets for k-fold cross-validation.
- plot: flag for plotting a diagram of the loss over cost complexity parameter alpha using matplotlib.
- slack: the amount of slack granted in chosing the best cost complexity parameter alpha. It is given as multiplier for the standard deviation of the alpha at minimum loss and allows thus to chose an alpha that is probably larger to account for the uncertainty in the k-fold cross validation procedure.
- Returns:
- create_tree(leaf_loss_threshold=1e-12)
- prune(alpha_max=None, test_set=None, metrics_only=False)
- Parameters:
- alpha_max: Stop the pruning procedure at this value of the cost complexity parameter alpha. If None, the tree is pruned down to its root giving the complete relationship between alpha and the loss. Default is None.
- test_set: data set to use for the evaluation off the losses. If None, the training set is used. Default is None.
- metrics_only: If True, pruning is performed on a copy of the tree, leaving the actual tree intact. Default is False
- validate(df=None)
- Parameters:
- df: dataframe to use for (cross-)validation. If None, the training set is used. Default is None.
- Returns:
- dict with metrics, e.g. accuracy or RSquared.
- Attributes
GradientBoostedTree(df, y_name, X_names=None, sample_frac=1, n_attributes=None, learning_rate=0.1, cart_settings={}, init_method="logistic", gamma=None, handle_missings="simple", s=None)
Class for a Gradient Boosted Tree model.
- Parameters
- df: pandas dataframe with training data
- y_name: name of the column with the output data/labels
- X_names: list of names with the inputs to use for the modelling. If None, all columns except y_name are chosen. Default is None.
- sample_frac: fraction (0, 1] of the training data to use for the training of an individual tree of the ensemble. Default is 1.
- n_attributes: number of attributes (elements of the X_names list) to use for the training of an individual tree of the ensemble. Default is None which corresponds to all available attributes.
- learning_rate: the shinkage parameter used to "downweight" individual trees of the ensemble. Default is 0.1.
- cart_settings: dict that is passed on to the constuctor of the individual tree (binarybeech.binarybeech.CART). For details cf. above.
- init_method: Metrics to use for the evaluation of split loss, etc if the initial tree (stump). Can be either "classification", "logistic", "regression", or None. Default is "regression". If None is chosen, the
is deduced from the training dataframe. - gamma: weight for individual trees of the ensemble. If None, the weight for each tree is chosen by line search minimizing the loss given by init_method.
- handle_missings: Specify the way how missing data is handeled. Can be eiter None or "simple".
- attribute_handlers: dict with data handler instances for each variable. The data handler determins, e.g., how splits of the dataset are made.
- Methods
- predict(df)
- Parameters:
- df: dataframe with inputs for predictions.
- Returns:
- array with predicted values/labels.
- train(M)
- Parameters:
- M: Number of individual trees to create for the ensemble.
- Returns:
- validate(df=None)
- Parameters:
- df: dataframe to use for (cross-)validation. If None, the training set is used. Default is None.
- Returns:
- dict with metrics, e.g. accuracy or RSquared.
- Attributes
AdaBoostTree(training_data=None, df=None, y_name=None, X_names=None, sample_frac=1, n_attributes=None, cart_settings={}, method="classification", handle_missings="simple", attribute_handlers=None, seed=None, algorithm_kwargs={})
Class for a AdaBoost model using CARTs as weak learners.
- Parameters:
- training_data: Preprocessed instance of class TrainingData.
- df: pandas dataframe with training data
- y_name: name of the column with the output data/labels
- X_names: list of names with the inputs to use for the modelling. If None, all columns except y_name are chosen. Default is None.
- method: Metrics to use for the evaluation of split loss, etc. Can be either "classification", "logistic", "regression", or None. Default is "regression". If None is chosen, the
is deduced from the training dataframe. - handle_missings: Specify the way how missing data is handeled. Can be eiter None or "simple".
- attribute_handlers: dict with attribute handler instances for each variable. The data handler determins, e.g., how splits of the dataset are made.
- Methods
- predict(df)
- Parameters:
- df: dataframe with inputs for predictions.
- Returns:
- array with predicted values/labels.
- train(M)
- Parameters:
- M: Number of individual trees to create for the ensemble.
- Returns:
- validate(df=None)
- Parameters:
- df: dataframe to use for (cross-)validation. If None, the training set is used. Default is None.
- Returns:
- dict with metrics, e.g. accuracy or RSquared.
- variable_importance():
- Returns:
- dict with normalized importance values.
- Attributes
RandomForest(df, y_name, X_names=None, verbose=False, sample_frac=1, n_attributes=None, cart_settings={}, method="regression", handle_missings="simple", attribute_handlers=None)
Class for a Random Forest model.
- Parameters
- df: pandas dataframe with training data
- y_name: name of the column with the output data/labels
- X_names: list of names with the inputs to use for the modelling. If None, all columns except y_name are chosen. Default is None.
- verbose: if set to True, status messages are sent to stdout. Default is False.
- sample_frac: fraction (0, 1] of the training data to use for the training of an individual tree of the ensemble. Default is 1.
- n_attributes: number of attributes (elements of the X_names list) to use for the training of an individual tree of the ensemble. Default is None which corresponds to all available attributes.
- cart_settings: dict that is passed on to the constuctor of the individual tree (binarybeech.binarybeech.CART). For details cf. above.
- method: Metrics to use for the evaluation of split loss, etc. Can be either "classification", "logistic", "regression", or None. Default is "regression". If None is chosen, the
is deduced from the training dataframe. - handle_missings: Specify the way how missing data is handeled. Can be eiter None or "simple".
- attribute_handlers: dict with attribute handler instances for each variable. The data handler determins, e.g., how splits of the dataset are made.
- Methods
- predict(df)
- Parameters:
- df: dataframe with inputs for predictions.
- Returns:
- array with predicted values/labels.
- train(M)
- Parameters:
- M: Number of individual trees to create for the ensemble.
- Returns:
- validate(df=None)
- Parameters:
- df: dataframe to use for (cross-)validation. If None, the training set is used. Default is None.
- Returns:
- dict with metrics, e.g. accuracy or RSquared.
- validate_oob():
- Returns:
- dict with metrics, e.g. accuracy or RSquared.
- variable_importance():
- Returns:
- dict with normalized importance values.
- Attributes
Decision trees are, by design, data type agnostic. With only a few methods like spliter for input variables and meaningful quantification for the loss, any data type can be perused. In this code, this is implemented using a factory pattern for data handling and metrics making decision tree learing simple and versatile.
For more information please feel free to take a look at the code.
Decision tree
Gradient Boosted Tree
Random Forest
Contributions in the form of pull requests are always welcome.