Why use a Bayesian approach to A/B testing?
Bayesian and Null Hypothesis Testing (sometimes call Frequentist) schools represent very different philosophies in statistics. For some background reading, these are good places to start:
- What is the difference between Bayesian and frequentist statistics?
- What is the difference between Bayesian and frequentist statisticians?
- Bayesian and frequentist reasoning in plain English
Until June 2013, Swrve used a null hypothesis approach, which is the more classic approach to running experiments like A/B tests. Null hypothesis testing (NHT) sets out with a hard assumption about the population being tested and the distributions of the things being tested. After running the test, the NHT approach is to determine how probable it is to have observed results as extreme as those observed during the test. This is captured by a p-score which is a measure of this probability. In an A/B test, we seek to disprove the null-hypothesis by being able to assert that with 95% probability (a p-value of 0.05 or less) we would not expect to see a difference between the A and B variants as extreme as that observed if the variants were in fact the same. If we can’t say this, then we accept the null hypothesis that the variants are actually the same.
There are a number of challenges with this approach, as outlined below:
- An assumed distribution – we start out with a very specific view that the populations will be distributed in a particular way. During the test we maintain this view, regardless of what happens, and only assess at the end of the test whether that view is consistent with the data we’ve observed. In other words, we can’t update our assumptions about the population as we see data come in.
- Fixed duration – in reality, a NHT should be run for a predetermined length of time (or number of conversions), and only at the end of this period should the results be observed. In practice, this means you shouldn’t look at the results along the way. If you do, you subject the results to an increased likelihood of a false positive. The more often you peek, the higher the chance that you’ll observe a false positive. You can account for this by varying down the required p-value threshold before declaring a winner but, when repeatedly observing the result, this becomes difficult.
- Family-wise error – a larger number of variations also gives rise to an increased rate of false positives. In general, if we’re prepared to accept a 5% chance that we observe an extreme difference between a variant and the control but where there isn’t actually a difference (remember, we declare a winner when we’re 95% confident), then as we add more variants, we increase the chances of making a false declaration. Again, we can correct for this by decreasing the required p-value using a correction factor (like Bonferroni or Dunnet corrections), but in general these tend to be conservative, resulting in longer test durations.
- No measure of magnitude of difference – with NHT we are only ever concerned with accepting or rejecting the null hypothesis. At no point can we assess how different two variants are, we can only determine a probability of how likely it is that the variants the same and have yet such extreme differences. With this approach we’re examining only how likely or unlikely it is to observe the data that we’ve observed.
After June 2013, Swrve deployed a Bayesian testing model. With Bayesian Inference, we can deal with all these issues with one consistent and mathematical treatment. When starting a test, we declare our assumption about the population being tested (define a prior distribution) which can incorporate information we already know about how populations behave (for example, normally less than 5% of people spend money in a mobile app). We can easily state that we know nothing about the population and decide that our prior will have no information (a constant).
As we observe results during the test, we update our model to determine a new model (a posteriori distribution) which captures our belief about the population based on the data we’ve observed so far. At any point in time we can use this model to determine if our observations support a winning conclusion, or if there still is not enough evidence to make a call. At each point in the experiment we can compute mathematical probabilities of beating the control or beating all other variants that have all the expected behavior (values between 0% and 100%, all probabilities sum to 100% and so forth).
Multiple variants in the test are handled automatically (though they do increase the complexity of the underlying calculations) and it’s possible to observe the test at any point and reason about the behavior of the test and each of the variants over time. Multiple variants only increase the time it takes to run a test if those variants are close. With null hypothesis testing we’ve no easy way of measuring closeness, and so adding variants to a test always increases the time to run the test.
We can also easily extend the experimentation framework to implement different sorts of tests (such as tests that are based on counts of events rather than conversions, or tests based on revenue generated by a variant).
Finally, as the tests are continuously observable, we are in a position to automate the tests and to add features like automatically stopping badly-performing variants, or directing new users to better performing variants over time (sometimes called explore-exploit schemes or multi-arm bandit systems).