Timeseriesflattener
Time series from e.g. electronic health records often have a large number of variables, are sampled at irregular intervals and tend to have a large number of missing values. Before this type of data can be used for prediction modelling with machine learning methods such as logistic regression or XGBoost, the data needs to be reshaped.
In essence, the time series need to be flattened so that each prediction time is represented by a set of predictor values and an outcome value. These predictor values can be constructed by aggregating the preceding values in the time series within a certain time window.
timeseriesflattener
aims to simplify this process by providing an easy-to-use and fully-specified pipeline for flattening complex time series.
🔧 Installation
To get started using timeseriesflattener simply install it using pip by running the following line in your terminal:
pip install timeseriesflattener
⚡ Quick start
import datetime as dt
import numpy as np
import polars as pl
prediction_times_df = pl.DataFrame(
{"id": [1, 1, 2], "date": ["2020-01-01", "2020-02-01", "2020-02-01"]}
)
predictor_df = pl.DataFrame(
{
"id": [1, 1, 1, 2],
"date": ["2020-01-15", "2019-12-10", "2019-12-15", "2020-01-02"],
"predictor_value": [1, 2, 3, 4],
}
)
outcome_df = pl.DataFrame({"id": [1], "date": ["2020-03-01"], "outcome_value": [1]})
from timeseriesflattener import (
MaxAggregator,
MinAggregator,
OutcomeSpec,
PredictionTimeFrame,
PredictorSpec,
ValueFrame,
)
predictor_spec = PredictorSpec(
value_frame=ValueFrame(
init_df=predictor_df, entity_id_col_name="id", value_timestamp_col_name="date"
),
lookbehind_distances=[dt.timedelta(days=1)],
aggregators=[MaxAggregator(), MinAggregator()],
fallback=np.nan,
column_prefix="pred",
)
outcome_spec = OutcomeSpec(
value_frame=ValueFrame(
init_df=outcome_df, entity_id_col_name="id", value_timestamp_col_name="date"
),
lookahead_distances=[dt.timedelta(days=1)],
aggregators=[MaxAggregator(), MinAggregator()],
fallback=np.nan,
column_prefix="outc",
)
from timeseriesflattener import Flattener
result = Flattener(
predictiontime_frame=PredictionTimeFrame(
init_df=prediction_times_df, entity_id_col_name="id", timestamp_col_name="date"
)
).aggregate_timeseries(specs=[predictor_spec, outcome_spec])
result.df
Output:
| id | date | prediction_time_uuid | pred_test_feature_within_30_days_mean_fallback_nan | outc_test_outcome_within_31_days_maximum_fallback_0_dichotomous |
---|
0 | 1 | 2020-01-01 00:00:00 | 1-2020-01-01-00-00-00 | 2.5 | 0 |
1 | 1 | 2020-02-01 00:00:00 | 1-2020-02-01-00-00-00 | 1 | 1 |
2 | 2 | 2020-02-01 00:00:00 | 2-2020-02-01-00-00-00 | 4 | 0 |
📖 Tutorial
💬 Where to ask questions
🎓 Projects
PSYCOP projects use timeseriesflattener
, see more at the monorepo.