New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More
Socket
Sign inDemoInstall
Socket

xgbse

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

xgbse

Improving XGBoost survival analysis with embeddings and debiased estimators

  • 0.3.3
  • PyPI
  • Socket score

Maintainers
1

DOI

xgbse: XGBoost Survival Embeddings

"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown." - Leo Breiman, Statistical Modeling: The Two Cultures

Survival Analysis is a powerful statistical technique with a wide range of applications such as predictive maintenance, customer churn, credit risk, asset liquidity risk, and others.

However, it has not yet seen widespread adoption in industry, with most implementations embracing one of two cultures:

  1. models with sound statistical properties, but lacking in expressivess and computational efficiency
  2. highly efficient and expressive models, but lacking in statistical rigor

xgbse aims to unite the two cultures in a single package, adding a layer of statistical rigor to the highly expressive and computationally effcient xgboost survival analysis implementation.

The package offers:

  • calibrated and unbiased survival curves with confidence intervals (instead of point predictions)
  • great predictive power, competitive to vanilla xgboost
  • efficient, easy to use implementation
  • explainability through prototypes

This is a research project by Loft Data Science Team, however we invite the community to contribute. Please help by trying it out, reporting bugs, and letting us know what you think!

Installation

pip install xgbse

Usage

Basic usage

The package follows scikit-learn API, with a minor adaptation to work with time and event data (y as a numpy structured array of times and events). .predict() returns a dataframe where each column is a time window and values represent the probability of survival before or exactly at the time window.

# importing dataset from pycox package
from pycox.datasets import metabric

# importing model and utils from xgbse
from xgbse import XGBSEKaplanNeighbors
from xgbse.converters import convert_to_structured

# getting data
df = metabric.read_df()

# splitting to X, y format
X = df.drop(['duration', 'event'], axis=1)
y = convert_to_structured(df['duration'], df['event'])

# fitting xgbse model
xgbse_model = XGBSEKaplanNeighbors(n_neighbors=50)
xgbse_model.fit(X, y)

# predicting
event_probs = xgbse_model.predict(X)
event_probs.head()
index153045607590105120135150165180195210225240255270285
00.980.870.810.740.710.660.530.470.420.40.30.250.210.160.120.0980.0850.0620.054
10.990.890.790.70.640.580.530.460.420.390.330.310.30.240.210.180.160.110.095
20.940.780.630.570.540.490.440.370.340.320.260.230.210.160.130.110.0980.0720.062
30.990.950.930.880.840.810.730.670.570.520.450.370.330.280.230.190.160.120.1
40.980.920.820.770.720.680.630.60.570.550.510.480.450.420.380.330.30.220.2

You can also get interval predictions (probability of failing exactly at each time window) using return_interval_probs:

# point predictions
interval_probs = xgbse_model.predict(X_valid, return_interval_probs=True)
interval_probs.head()
index153045607590105120135150165180195210225240255270285
00.0240.10.0580.070.0340.050.120.0610.0490.0260.0960.0490.0390.0560.040.020.0130.0230.0078
10.0140.0970.0980.0930.0520.0650.0540.0680.0340.0380.0530.0190.0180.0520.0380.0270.0180.050.015
20.060.160.150.0540.0330.0530.0460.0730.0320.0140.060.030.0170.0550.0310.0160.0140.0270.0097
30.0110.0340.0210.0530.0380.0380.080.0520.10.0490.0750.0790.0370.0520.0530.0410.0260.040.017
40.0160.0670.0990.0460.050.0420.0510.0280.030.0180.0480.0220.0290.0380.0350.0470.0310.080.027

Survival curves and confidence intervals

XBGSEKaplanTree and XBGSEKaplanNeighbors support estimation of survival curves and confidence intervals via the Exponential Greenwood formula out-of-the-box via the return_ci argument:

# fitting xgbse model
xgbse_model = XGBSEKaplanNeighbors(n_neighbors=50)
xgbse_model.fit(X_train, y_train, time_bins=TIME_BINS)

# predicting
mean, upper_ci, lower_ci = xgbse_model.predict(X_valid, return_ci=True)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

XGBSEDebiasedBCE does not support estimation of confidence intervals out-of-the-box, but we provide the XGBSEBootstrapEstimator to get non-parametric confidence intervals. As the stacked logistic regressions are trained with more samples (in comparison to neighbor-sets in XGBSEKaplanNeighbors), confidence intervals are more concentrated:

# base model as BCE
base_model = XGBSEDebiasedBCE(PARAMS_XGB_AFT, PARAMS_LR)

# bootstrap meta estimator
bootstrap_estimator = XGBSEBootstrapEstimator(base_model, n_estimators=20)

# fitting the meta estimator
bootstrap_estimator.fit(
    X_train,
    y_train,
    validation_data=(X_valid, y_valid),
    early_stopping_rounds=10,
    time_bins=TIME_BINS,
)

# predicting
mean, upper_ci, lower_ci = bootstrap_estimator.predict(X_valid, return_ci=True)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

The bootstrap abstraction can be used for XBGSEKaplanTree and XBGSEKaplanNeighbors as well, however, the confidence interval will be estimated via bootstrap only (not Exponential Greenwood formula):

# base model
base_model = XGBSEKaplanTree(PARAMS_TREE)

# bootstrap meta estimator
bootstrap_estimator = XGBSEBootstrapEstimator(base_model, n_estimators=100)

# fitting the meta estimator
bootstrap_estimator.fit(
    X_train,
    y_train,
    time_bins=TIME_BINS,
)

# predicting
mean, upper_ci, lower_ci = bootstrap_estimator.predict(X_valid, return_ci=True)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

With a sufficiently large n_estimators, interval width shouldn't be much different, with the added benefit of model stability and improved accuracy. Addittionaly, XGBSEBootstrapEstimator allows building confidence intervals for interval probabilities (which is not supported for Exponential Greenwood):

# predicting
mean, upper_ci, lower_ci = bootstrap_estimator.predict(
    X_valid,
    return_ci=True,
    return_interval_probs=True
)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

The parameter ci_width controls the width of the confidence interval. For XGBSEKaplanTree it should be passed at .fit(), as KM curves are pre-calculated for each leaf at fit time to avoid storing training data.

# fitting xgbse model
xgbse_model = XGBSEKaplanTree(PARAMS_TREE)
xgbse_model.fit(X_train, y_train, time_bins=TIME_BINS, ci_width=0.99)

# predicting
mean, upper_ci, lower_ci = xgbse_model.predict(X_valid, return_ci=True)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

For other models (XGBSEKaplanNeighbors and XGBSEBootstrapEstimator) it should be passed at .predict().

# base model
model = XGBSEKaplanNeighbors(PARAMS_XGB_AFT, N_NEIGHBORS)

# fitting the meta estimator
model.fit(
    X_train, y_train,
    validation_data = (X_valid, y_valid),
    early_stopping_rounds=10,
    time_bins=TIME_BINS
)

# predicting
mean, upper_ci, lower_ci = model.predict(X_valid, return_ci=True, ci_width=0.99)

# plotting CIs
plot_ci(mean, upper_ci, lower_ci)

Extrapolation

We provide an extrapolation interface, to deal with cases where the survival curve does not end at 0 due to censoring. Currently the implementation only supports extrapolation assuming constant risk.

from xgbse.extrapolation import extrapolate_constant_risk

# predicting
survival = bootstrap_estimator.predict(X_valid)

# extrapolating
survival_ext = extrapolate_constant_risk(survival, 450, 15)

Early stopping

A simple interface to xgboost early stopping is provided.

# splitting between train, and validation
(X_train, X_valid,
 y_train, y_valid) = \
train_test_split(X, y, test_size=0.2, random_state=42)

# fitting with early stopping
xgb_model = XGBSEDebiasedBCE()
xgb_model.fit(
    X_train,
    y_train,
    validation_data=(X_valid, y_valid),
    early_stopping_rounds=10,
    verbose_eval=50
)
[0]	validation-aft-nloglik:16.86713
Will train until validation-aft-nloglik hasn't improved in 10 rounds.
[50]	validation-aft-nloglik:3.64540
[100]	validation-aft-nloglik:3.53679
[150]	validation-aft-nloglik:3.53207
Stopping. Best iteration:
[174]	validation-aft-nloglik:3.53004

Explainability through prototypes

xgbse also provides explainability through prototypes, searching the embedding for neighbors. The idea is to explain model predictions with real samples, providing solid ground to justify them (see [8]). The method .get_neighbors() searches for the n_neighbors nearest neighbors in index_data for each sample in query_data:

neighbors = xgb_model.get_neighbors(
    query_data=X_valid,
    index_data=X_train,
    n_neighbors=10
)
neighbors.head(5)
indexneighbor_1neighbor_2neighbor_3neighbor_4neighbor_5neighbor_6neighbor_7neighbor_8neighbor_9neighbor_10
1225115115131200146215452128411271895257
111189710901743122489216951624154614184
5549627125714601031157515574401236858
52672610421771640242152923418003991431
13132051738599954169417151651828541992

This way, we can inspect neighbors of a given sample to try to explain predictions. For instance, we can choose a reference and check that its neighbors actually are very similar as a sanity check:

i = 0

reference = X_valid.iloc[i]
reference.name = 'reference'
train_neighs = X_train.loc[neighbors.iloc[i]]

pd.concat([reference.to_frame().T, train_neighs])
indexx0x1x2x3x4x5x6x7x8
reference5.75.7115.6110186
11515.85.9115.5110182
15135.55.5115.6110179
12005.76115.6110176
1465.95.9115.5010175
2155.85.5115.4110178
4525.75.7125.5000176
12845.66.2115.6100179
11275.55.1115.5110186
18955.55.4105.5110185
2575.769.65.6110176

We also can compare the Kaplan-Meier curve estimated from the neighbors to the actual model prediction, checking that it is inside the confidence interval:

from xgbse.non_parametric import calculate_kaplan_vectorized

mean, high, low = calculate_kaplan_vectorized(
    np.array([y['c2'][neighbors.iloc[i]]]),
    np.array([y['c1'][neighbors.iloc[i]]]),
    TIME_BINS
)

model_surv = xgb_model.predict(X_valid)

plt.figure(figsize=(12,4), dpi=120)
plt.plot(model_surv.columns, model_surv.iloc[i])
plt.plot(mean.columns, mean.iloc[0])
plt.fill_between(mean.columns, low.iloc[0], high.iloc[0], alpha=0.1, color='red')

Specifically, for XBGSEKaplanNeighbors prototype predictions and model predictions should match exactly if n_neighbors is the same and query_data is equal to the training data.

Metrics

We made our own metrics submodule to make the lib self-contained. xgbse.metrics implements C-index, Brier Score and D-Calibration from [9], including adaptations to deal with censoring:

# training model
xgbse_model = XGBSEKaplanNeighbors(PARAMS_XGB_AFT, n_neighbors=30)

xgbse_model.fit(
    X_train, y_train,
    validation_data = (X_valid, y_valid),
    early_stopping_rounds=10,
    time_bins=TIME_BINS
)

# predicting
preds = xgbse_model.predict(X_valid)

# importing metrics
from xgbse.metrics import (
    concordance_index,
    approx_brier_score,
    dist_calibration_score
)

# running metrics
print(f'C-index: {concordance_index(y_valid, preds)}')
print(f'Avg. Brier Score: {approx_brier_score(y_valid, preds)}')
print(f"""D-Calibration: {dist_calibration_score(y_valid, preds) > 0.05}""")
C-index: 0.6495863029409356
Avg. Brier Score: 0.1704190044350422
D-Calibration: True

As metrics follow the score_func(y, y_pred, **kwargs) pattern, we can use the sklearn model selection module easily:

from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

xgbse_model = XGBSEKaplanTree(PARAMS_TREE)
results = cross_val_score(xgbse_model, X, y, scoring=make_scorer(approx_brier_score))
results
array([0.17432953, 0.15907712, 0.13783666, 0.16770409, 0.16792016])

If you want to dive deeper, please check our docs and examples!

References

[1] Practical Lessons from Predicting Clicks on Ads at Facebook: paper that shows how stacking boosting models with logistic regression improves performance and calibration

[2] Feature transformations with ensembles of trees: scikit-learn post showing tree ensembles as feature transformers

[3] Calibration of probabilities for tree-based models: blog post showing a practical example of tree ensemble probability calibration with a logistic regression

[4] Supervised dimensionality reduction and clustering at scale with RFs with UMAP: blog post showing how forests of decision trees act as noise filters, reducing intrinsic dimension of the dataset.

[5] Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors: inspiration for the BCE method (multi-task logistic regression)

[6] The Brier Score under Administrative Censoring: Problems and Solutions: reference to BCE (binary cross-entropy survival method).

[7] The Greenwood and Exponential Greenwood Confidence Intervals in Survival Analysis: reference we used for the Exponential Greenwood formula from KM confidence intervals

[8] Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable: paper showing a very similar method for extracting prototypes

[9] Effective Ways to Build and Evaluate Individual Survival Distributions: paper showing how to validate survival analysis models with different metrics

Citing xgbse

To cite this repository:

@software{xgbse2020github,
  author = {Davi Vieira and Gabriel Gimenez and Guilherme Marmerola and Vitor Estima},
  title = {XGBoost Survival Embeddings: improving statistical properties of XGBoost survival analysis implementation},
  url = {http://github.com/loft-br/xgboost-survival-embeddings},
  version = {0.3.3},
  year = {2021},
}

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc