November 25, 2019

Can you run an A/B test with unequal sample sizes?

I got an interesting question this week from a PM this week, asking if we could run an experiment with a traffic allocation of 10% to control and 90% to the variation, rather than a traditional 50–50 split. Most sample size calculators—including our own internal one—assumes an equal split between 2+ variations, so I had to take a step back to answer this question.

TL;DR

You can run an experiment with an unequal allocation (e.g. 10–90) as long as you don’t modify the allocation while the experiment is running. However it will be less efficient than a 50–50 allocation—either your test will have less power, or you will need to run it longer to achieve a comparable result.

Do unequal sample sizes bias results?

We want our A/B test results to be an unbiased estimator of the true effect. To achieve this, we rely on randomized assignment to “spread out” the influence of confounding factors equally across variations, so that they do not influence our relative comparison of different or uplift between the variations. Even if the proportion of users asigned to each variation is unequal, randomized assignment still works as long as we don’t change the traffic split. You should never modify the traffic allocation mid-experiment, because this can introduce temporal bias into your results. 1

Are unequal sample sizes efficient?

So it is possible to run an experiment with a non 50–50 split, but is it adviseable? If our goal is to achieve some predetermined risk profile as quickly as possible—then probably not.

Suppose we have a 15% conversion rate, and are designing an experiment to detect a 1% absolute increase with 90% power and 90% confidence. Let’s use the pwr R library below, because it supports non-equal sample sizes. 2

library(pwr)

n1 = 25000
n2 = 25000
p1 = 0.15
p2 = 0.16
h = abs(2*asin(sqrt(p1))-2*asin(sqrt(p2)))

pwr.2p2n.test(h, n1=n1, n2=n2, sig.level=0.10)
             n1 = 25000
             n2 = 25000
      sig.level = 0.1
          power = 0.9257466
    alternative = two.sided

So with a 50—50 split, you need to run the experiment on 50k total users—25k per variation—to get the desired result. What happens if we use a 10–90 split instead?

n1 = 5000
n2 = 45000

pwr.2p2n.test(h, n1=n1, n2=n2, sig.level=0.10)
             n1 = 5000
             n2 = 45000
      sig.level = 0.1
          power = 0.5829899
    alternative = two.sided

Uh-oh! The power of your experiment—its ability to detect a true effect—falls to under 60%. Let’s scale up our total sample size to find the point at which we achieve a similar power as our initial plan.

n1 = 5000 * 2.8
n2 = 45000 * 2.8

pwr.2p2n.test(h, n1=n1, n2=n2, sig.level=0.10)
             n1 = 14000
             n2 = 126000
      sig.level = 0.1
          power = 0.9274638
    alternative = two.sided

So a 10–90 allocation would require 2.8x as many total users to reach a similar outcome as a 50–50 split. We can understand why this is the case by looking at the formula for the standard error of the difference between two binomial proportions, which defines the width of our confidence intervals. $$ SE_{\Delta} = \sqrt{\frac{p_a(a-p_a)}{n_a} + \frac{p_b(1-p_b)}{n_b}} $$ A lower standard error equals greater certainty. The overall term will decrease whenever we collect samples in either variation, increasing either $ n_1 $ or $ n_2 $. But there are diminishing returns as $n_i $ increases. Suppose we’ve already collected 1000 samples in variation A, but only 100 samples in variation B. Collecting an additional 100 samples in A will only half of the term under the square root by 10%, whereas an additional 100 samples in B would cut that term in half.

When do unequal sample sizes make sense?

If you look closely at the R outputs above, you’ll notice that while our total users required is 2.8x, the number of users assigned to the control group (n1) is actually lower—14k vs 25k. So if we have a very strong prior belief in our change—but still want to perform some perfunctory experimentation—an unequal sample size could make sense here. But it’s a double-edged sword: if your change is worse than baseline, you will have ultimately exposed more users to the change than necessary to reach a conclusive result. Probably best to keep it 50–50, since your typical A/B test design involves enough factors to consider already.

© Geoff Ruddock 2019