Bayesian A/B testing
bayesian_testing
is a small package for a quick evaluation of A/B (or A/B/C/...) tests using
Bayesian approach.
Implemented tests:
- BinaryDataTest
- Input data - binary data (
[0, 1, 0, ...]
) - Designed for conversion-like data A/B testing.
- NormalDataTest
- Input data - normal data with unknown variance
- Designed for normal data A/B testing.
- DeltaLognormalDataTest
- Input data - lognormal data with zeros
- Designed for revenue-like data A/B testing.
- DeltaNormalDataTest
- Input data - normal data with zeros
- Designed for profit-like data A/B testing.
- DiscreteDataTest
- Input data - categorical data with numerical categories
- Designed for discrete data A/B testing (e.g. dice rolls, star ratings, 1-10 ratings, etc.).
- PoissonDataTest
- Input data - non-negative integers (
[1, 0, 3, ...]
) - Designed for poisson data A/B testing.
- ExponentialDataTest
- Input data - exponential data (non-negative real numbers)
- Designed for exponential data A/B testing (e.g. session/waiting time, time between events,
etc.).
Implemented evaluation metrics:
Posterior Mean
- Expected value from the posterior distribution for a given variant.
Credible Interval
- Quantile-based credible intervals based on simulations from posterior distributions (i.e.
empirical).
- Interval probability (
interval_alpha
) can be set during the evaluation (default value is 95%).
Probability of Being Best
- Probability that a given variant is best among all variants.
- By default,
the best
is equivalent to the greatest
(from a data/metric point of view),
however it is possible to change this by using min_is_best=True
in the evaluation method
(this can be useful if we try to find the variant with the smallest tested measure).
Expected Loss
- "Risk" of choosing particular variant over other variants in the test.
- Measured in same units as a tested measure (e.g. positive rate or average value).
Credible Interval
, Probability of Being Best
and Expected Loss
are calculated using
simulations from posterior distributions (considering given data).
Installation
bayesian_testing
can be installed using pip:
pip install bayesian_testing
Alternatively, you can clone the repository and use poetry
manually:
cd bayesian_testing
pip install poetry
poetry install
poetry shell
Basic Usage
The primary features are classes:
BinaryDataTest
NormalDataTest
DeltaLognormalDataTest
DeltaNormalDataTest
DiscreteDataTest
PoissonDataTest
ExponentialDataTest
All test classes support two methods to insert the data:
add_variant_data
- Adding raw data for a variant as a list of observations (or numpy 1-D array).add_variant_data_agg
- Adding aggregated variant data (this can be practical for a large data,
as the aggregation can be done already on a database level).
Both methods for adding data allow specification of prior distributions
(see details in respective docstrings). Default prior setup should be sufficient for most of the
cases (e.g. cases with unknown priors or large amounts of data).
To get the results of the test, simply call the method evaluate
.
Probability of being best, expected loss and credible intervals are approximated using simulations,
hence the evaluate
method can return slightly different values for different runs. To stabilize
it, you can set the sim_count
parameter of the evaluate
to a higher value (default value is
20K), or even use the seed
parameter to fix it completely.
BinaryDataTest
Class for a Bayesian A/B test for the binary-like data (e.g. conversions, successes, etc.).
Example:
import numpy as np
from bayesian_testing.experiments import BinaryDataTest
rng = np.random.default_rng(52)
data_a = rng.binomial(n=1, p=0.052, size=1500)
data_b = rng.binomial(n=1, p=0.067, size=1200)
test = BinaryDataTest()
test.add_variant_data("A", data_a)
test.add_variant_data("B", data_b)
test.add_variant_data_agg("C", totals=1000, positives=50)
results = test.evaluate()
results
+-------------------+-----------+-------------+-------------+
| | A | B | C |
+===================+===========+=============+=============+
| totals | 1500 | 1200 | 1000 |
+-------------------+-----------+-------------+-------------+
| positives | 80 | 80 | 50 |
+-------------------+-----------+-------------+-------------+
| positive_rate | 0.05333 | 0.06667 | 0.05 |
+-------------------+-----------+-------------+-------------+
| posterior_mean | 0.05363 | 0.06703 | 0.05045 |
+-------------------+-----------+-------------+-------------+
| credible_interval | [0.04284, | [0.0535309, | [0.0379814, |
| | 0.065501] | 0.0816476] | 0.0648625] |
+-------------------+-----------+-------------+-------------+
| prob_being_best | 0.06485 | 0.89295 | 0.0422 |
+-------------------+-----------+-------------+-------------+
| expected_loss | 0.0139248 | 0.0004693 | 0.0170767 |
+-------------------+-----------+-------------+-------------+
NormalDataTest
Class for a Bayesian A/B test for the normal data.
Example:
import numpy as np
from bayesian_testing.experiments import NormalDataTest
rng = np.random.default_rng(21)
data_a = rng.normal(7.2, 2, 1000)
data_b = rng.normal(7.1, 2, 800)
data_c = rng.normal(7.0, 4, 500)
test = NormalDataTest()
test.add_variant_data("A", data_a)
test.add_variant_data("B", data_b)
test.add_variant_data_agg("C", len(data_c), sum(data_c), sum(np.square(data_c)))
results = test.evaluate(sim_count=20000, seed=52, min_is_best=False, interval_alpha=0.99)
results
+-------------------+-------------+-------------+-------------+
| | A | B | C |
+===================+=============+=============+=============+
| totals | 1000 | 800 | 500 |
+-------------------+-------------+-------------+-------------+
| sum_values | 7294.67901 | 5685.86168 | 3736.91581 |
+-------------------+-------------+-------------+-------------+
| avg_values | 7.29468 | 7.10733 | 7.47383 |
+-------------------+-------------+-------------+-------------+
| posterior_mean | 7.29462 | 7.10725 | 7.4737 |
+-------------------+-------------+-------------+-------------+
| credible_interval | [7.1359436, | [6.9324733, | [7.0240102, |
| | 7.4528369] | 7.2779293] | 7.9379341] |
+-------------------+-------------+-------------+-------------+
| prob_being_best | 0.1707 | 0.00125 | 0.82805 |
+-------------------+-------------+-------------+-------------+
| expected_loss | 0.1968735 | 0.385112 | 0.0169998 |
+-------------------+-------------+-------------+-------------+
DeltaLognormalDataTest
Class for a Bayesian A/B test for the delta-lognormal data (log-normal with zeros).
Delta-lognormal data is typical case of revenue per session data where many sessions have 0 revenue
but non-zero values are positive values with possible log-normal distribution.
To handle this data, the calculation is combining binary Bayes model for zero vs non-zero
"conversions" and log-normal model for non-zero values.
Example:
import numpy as np
from bayesian_testing.experiments import DeltaLognormalDataTest
test = DeltaLognormalDataTest()
data_a = [7.1, 0.3, 5.9, 0, 1.3, 0.3, 0, 1.2, 0, 3.6, 0, 1.5,
2.2, 0, 4.9, 0, 0, 1.1, 0, 0, 7.1, 0, 6.9, 0]
data_b = [4.0, 0, 3.3, 19.3, 18.5, 0, 0, 0, 12.9, 0, 0, 0, 10.2,
0, 0, 23.1, 0, 3.7, 0, 0, 11.3, 10.0, 0, 18.3, 12.1]
test.add_variant_data("A", data_a)
test.add_variant_data_agg(
name="B",
totals=len(data_b),
positives=sum(x > 0 for x in data_b),
sum_values=sum(data_b),
sum_logs=sum([np.log(x) for x in data_b if x > 0]),
sum_logs_2=sum([np.square(np.log(x)) for x in data_b if x > 0])
)
results = test.evaluate(seed=21)
results
+---------------------+-------------+-------------+
| | A | B |
+=====================+=============+=============+
| totals | 24 | 25 |
+---------------------+-------------+-------------+
| positives | 13 | 12 |
+---------------------+-------------+-------------+
| sum_values | 43.4 | 146.7 |
+---------------------+-------------+-------------+
| avg_values | 1.80833 | 5.868 |
+---------------------+-------------+-------------+
| avg_positive_values | 3.33846 | 12.225 |
+---------------------+-------------+-------------+
| posterior_mean | 2.09766 | 6.19017 |
+---------------------+-------------+-------------+
| credible_interval | [0.9884509, | [3.3746212, |
| | 6.9054963] | 11.7349253] |
+---------------------+-------------+-------------+
| prob_being_best | 0.04815 | 0.95185 |
+---------------------+-------------+-------------+
| expected_loss | 4.0941101 | 0.1588627 |
+---------------------+-------------+-------------+
Note: Alternatively, DeltaNormalDataTest
can be used for a case when conversions are not
necessarily positive values.
DiscreteDataTest
Class for a Bayesian A/B test for the discrete data with finite number of numerical categories
(states), representing some value.
This test can be used for instance for dice rolls data (when looking for the "best" of multiple
dice) or rating data (e.g. 1-5 stars or 1-10 scale).
Example:
from bayesian_testing.experiments import DiscreteDataTest
data_a = [2, 5, 1, 4, 6, 2, 2, 6, 3, 2, 6, 3, 4, 6, 3, 1, 6, 3, 5, 6]
data_b = [1, 2, 2, 2, 2, 3, 2, 3, 4, 2]
data_c = [1, 3, 6, 5, 4]
test = DiscreteDataTest(states=[1, 2, 3, 4, 5, 6])
test.add_variant_data("A", data_a)
test.add_variant_data("B", data_b)
test.add_variant_data("C", data_c)
results = test.evaluate(sim_count=20000, seed=52, min_is_best=False, interval_alpha=0.95)
results
+-------------------+------------------+------------------+------------------+
| | A | B | C |
+===================+==================+==================+==================+
| concentration | {1: 2.0, 2: 4.0, | {1: 1.0, 2: 6.0, | {1: 1.0, 2: 0.0, |
| | 3: 4.0, 4: 2.0, | 3: 2.0, 4: 1.0, | 3: 1.0, 4: 1.0, |
| | 5: 2.0, 6: 6.0} | 5: 0.0, 6: 0.0} | 5: 1.0, 6: 1.0} |
+-------------------+------------------+------------------+------------------+
| average_value | 3.8 | 2.3 | 3.8 |
+-------------------+------------------+------------------+------------------+
| posterior_mean | 3.73077 | 2.75 | 3.63636 |
+-------------------+------------------+------------------+------------------+
| credible_interval | [3.0710797, | [2.1791584, | [2.6556465, |
| | 4.3888021] | 3.4589178] | 4.5784839] |
+-------------------+------------------+------------------+------------------+
| prob_being_best | 0.54685 | 0.008 | 0.44515 |
+-------------------+------------------+------------------+------------------+
| expected_loss | 0.199953 | 1.1826766 | 0.2870247 |
+-------------------+------------------+------------------+------------------+
PoissonDataTest
Class for a Bayesian A/B test for the poisson data.
Example:
from bayesian_testing.experiments import PoissonDataTest
psg_goals_against = [0, 2, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 3, 1, 0]
city_goals_against = [0, 0, 3, 2, 0, 1, 0, 3, 0, 1, 1, 0, 1, 2]
bayern_goals_against = [1, 0, 0, 1, 1, 2, 1, 0, 2, 0, 0, 2, 2, 1, 0]
test = PoissonDataTest()
test.add_variant_data('psg', psg_goals_against)
test.add_variant_data('city', city_goals_against, a_prior=3, b_prior=1)
test.add_variant_data_agg("bayern", len(bayern_goals_against), sum(bayern_goals_against))
results = test.evaluate(sim_count=20000, seed=52, min_is_best=True)
results
+-------------------+-------------+-------------+------------+
| | psg | city | bayern |
+===================+=============+=============+============+
| totals | 15 | 14 | 15 |
+-------------------+-------------+-------------+------------+
| sum_values | 9 | 14 | 13 |
+-------------------+-------------+-------------+------------+
| observed_average | 0.6 | 1.0 | 0.86667 |
+-------------------+-------------+-------------+------------+
| posterior_mean | 0.60265 | 1.13333 | 0.86755 |
+-------------------+-------------+-------------+------------+
| credible_interval | [0.2800848, | [0.6562029, | [0.465913, |
| | 1.0570327] | 1.7265045] | 1.3964389] |
+-------------------+-------------+-------------+------------+
| prob_being_best | 0.78175 | 0.0344 | 0.18385 |
+-------------------+-------------+-------------+------------+
| expected_loss | 0.0369998 | 0.5620553 | 0.3003345 |
+-------------------+-------------+-------------+------------+
note: Since we set min_is_best=True
(because received goals are "bad"), probability and loss are
in a favor of variants with lower posterior means.
ExponentialDataTest
Class for a Bayesian A/B test for the exponential data.
Example:
import numpy as np
from bayesian_testing.experiments import ExponentialDataTest
waiting_times_a = np.random.exponential(scale=10, size=200)
waiting_times_b = np.random.exponential(scale=11, size=210)
waiting_times_c = np.random.exponential(scale=11, size=220)
test = ExponentialDataTest()
test.add_variant_data('A', waiting_times_a)
test.add_variant_data('B', waiting_times_b)
test.add_variant_data('C', waiting_times_c)
results = test.evaluate(sim_count=20000, min_is_best=True)
results
+-------------------+-------------+-------------+-------------+
| | A | B | C |
+===================+=============+=============+=============+
| totals | 200 | 210 | 220 |
+-------------------+-------------+-------------+-------------+
| sum_values | 1827.81709 | 2217.46016 | 2160.73134 |
+-------------------+-------------+-------------+-------------+
| observed_average | 9.13909 | 10.55933 | 9.82151 |
+-------------------+-------------+-------------+-------------+
| posterior_mean | 9.13502 | 10.55478 | 9.8175 |
+-------------------+-------------+-------------+-------------+
| credible_interval | [7.994178, | [9.2543372, | [8.6184821, |
| | 10.5410967] | 12.1527256] | 11.2566538] |
+-------------------+-------------+-------------+-------------+
| prob_being_best | 0.7456 | 0.0405 | 0.2139 |
+-------------------+-------------+-------------+-------------+
| expected_loss | 0.1428729 | 1.5674747 | 0.8230728 |
+-------------------+-------------+-------------+-------------+
Development
To set up a development environment, use Poetry and pre-commit:
pip install poetry
poetry install
poetry run pre-commit install
To be implemented
Additional metrics:
Potential Value Remaining
References