How to properly analyze an AB test

On most of online business companies, it's really important to analyze the behaviour of the users in order to improve the product over time: e-commerce sites, video games, blogs, video channels,... everyone does it (or at least should do it).  One of the most well-known techniques is known as the A/B tests. As the name implies, two versions (A and B) are compared, which are identical except for one variation that might affect a user's behavior. Version A might be the currently used version (control), while Version B is modified in some respect (treatment). In this article we're going to talk about how hypothesis testing can tell you whether your A/B tests actually effect user behavior, or whether the variations you see are due to random chance.

This article won't explain you how to properly do an A/B test nor the basic statistics behind it. We'll go further deeper into its analysis and how to properly analyze it, since I've seen a lot of companies doing nice A/B tests in order to improve their product, but then doing a poor analysis of them. We'll focus on a simple and common A/B testing on e-commerce and online marketing: analyzing the conversion rates on a landing page in order to see which one is better than the other.

Landing Page Conversion Rate

We'll focus on a simple and common A/B testing on e-commerce and online marketing: analyzing the conversion rates on a landing page in order to see which one is better than the other. Let's suppose you have a landing page with a signup form on it. We want to test differents layouts or designs in order to maximize the percentage of people who sign up on it. This percentage is known in online marketing world as the conversion rate, the rate at which you convert users from visitors to customers (or from visitors to active users on a video game environment, for instance).

Let's suppose we (randomly or with some kind of criteria) split our users who come by our website into four groups. For the purpose of this experiment let's just call them control group, A, B and C, but we can call them whatever we want (for instance name them using the color of the background, with the functionality that's different in each design, etc...)

Data collected

Our website starts attracting visitors continuously. We've analyzed our business model and we've realized that the most important thing we need to improve right now is the conversion rate at the landing page. We put ourselves a mark regarding the minimum acceptable conversion rate. Let's suppose we want a conversion rate equal or higher than 22% so, this way we'll only have to focus on those groups who have a value greater or equal than this one.

We can create an A/B test with the four groups: control, A, B, and C. Let's suppose you have some data like the following one:

My Website Landing Page Test
Group Visitors Registers Conversion Rate
Control 362 72 19,89%
Treatment A 369 90 24,39%
Treatment B 368 55 14,95%
Treatment C 371 120 32,35%

From the previous data, we see that both groups A and C show at least a conversion rate higher than 22% in the landing page performance, which was our initial goal. As Group C has the highest CVR from all the different groups, we can think that this group is good enough, choose it as the main design for you website and keep moving on (in fact, I know different companies who do this...) But hey! We know a little statistics men! How do we know the variation isn't due to random chance? What if instead of around 360 visitors (which is a huge number of visitors!) we only had 10 or 20 visitors treated? Would we still be so confident to decide which one is the best choice?

In statistics world, it's impossible to say something with a 100% of confidence, it's impossible (in fact, if we wanted to, it would take us infinite time, more time than our lifes or even more time than Internet has been running on). So, it's really common when talking about confidence interval, we'll be using a 95% confidence interval, which are usually good enough for our purposes.

Hypothesis testing is always about validating our confidence, so let's get to it.

Statistics come to our help!

When we do a hypothesis testing, we need to start with a null hypothesis we want to check out. In this case, the null hypothesis will be that the conversion rate of the control group is no less than the conversion rate of our experimental groups. Mathematically this is modelled by:

where  is the conversion rate of our control group and  is the conversion rate of one of our experiments (where x takes the values A, B or C).

The alternative hypothesis is therefore that the experimental page has a higher conversion rate than the control group.

This is what we want to see and quantify right now. In fact, we're aiming to prove that our null hypothesis.

When working with this kind of experiments, the sampled conversion rates are all normally distributed random variables. It's just like the classic example in statistics of the coin flipping, except that instead of the events of heads and tails as the possible results of the experiment, we have the events convert and doesn't convert. The main task now is to see if the experimental group deviates too far from the control treatment in order to get a valid result.

Here's an example representation of the distribution of the control conversion rate and the treatment conversion rate.

Comparison of normal distributions between control and treatment groups.

Comparison of normal distributions between control and treatment groups.

The maximum peak of each curve is the conversion rate we measure, and the width of the curve tell us how sparsed the data is. What we're looking forward to is to measure the difference between both curves, in order to compare the difference between the two conversion rates and see if that difference is large enough. If we are sure about this fact, we'll be able to conclude that the treatment applied to our users really has affected (positively or negatively) our users behavior.

In order to do this, let's define a new random variable and let's call it :

 for each x in our different treatments (aka A, B and C) then our null hypothesis which we want to prove or discard becomes .

We can now use the same techniques from our previous coin flip exercise, but using the random variable  to give some conclusions based on the given data on our previous table. But since the events doesn't have a 50/50 chances of happening like in the coin flip experiment, to deal with this,  we need to know the probability distribution of .

There is a theorem that says that that the sum (or difference) of two normally distributed random variables is itself normally distributed. You can read more about this in this Wikipedia article, where you'll also find the proof of this classical theorem of mathematics.

So finally, this will give us a way to calculate a 95% confidence interval.

Z-scores and One-tailed Tests

We can mathematically  define the z-score for  like

where  is the sample size of the control group and  is the sample size of each of the treatment groups. This is due to the fact that the mean  of  (which is the above part of the ratio) is  and the variance and the distribution (whose square root becomes the second part of the ratio) is the sum of the variances of  and  respectively.

When we dealt with the con flip experiment,  the 95% confidence interval corresponded to a z-score of 1.96. In this case, it'll be different. In the coin experiment we rejected the null hypothesis if the percentage of heads was either too high or too low. The null hypothesis in this case was probability = 0.5 but in our case, the null hypothesis will become X is less or equal to zero.

This means that, the only tail we need to care about of the normal distribution, is the negative one. This is the only area that can let us reject the null hypotesis (control treatment) and accept the analyzed treatment .  Let's see a few graphics about this to show exactly the difference between a two sided confidence interval (like in the coin flip example) and a sigle side confidence interval (as the one we're dealing with in our AB testing).

In the first case, the coin example, we have a two sided confidence interval, so we have to reject the null hypothesis if the percentage of heads goes either too high or too low, leading its z-score to fall into the tails of the normal curve.

95% confidence interval over Normal curve with both tails colored

Both colored areas take over 2.5% of the total area under de curve (the sum of both give us the 5% of the total area)

In the second case, our AB testing example, we have a single side confidence interval, so we only have to reject the null hypothesis if the experimental conversion rate is significantly higher than the control conversion rate. In this case,

95% confidence interval over Normal curve with one tail colored

This right colored area is 5% of the total area under de curve.

In other words, we can reject the null hypothesis with 95% confidence if the z-score is higher than 1.65. Here, there is a table with the z-scores calculated using the formula above so we have a better idea of the performance of each one of the treatments:

My Website Landing Page Test
Group Visitors Registers Conversion Rate z-score
Control 362 72 19,89% -
Treatment A 369 90 24,39% 1,47
Treatment B 368 55 14,95% -1,76
Treatment C 371 120 32,35% 3,88


From the previous table, we can finally say some conclusions and see that:

  • Treatment C has outperformed without any doubt our control treatment with the highest z-score of all higher that 1.65.
  • Treatment A has little statistically significance, but it's irrelevant at this point because we see the performance of Treatment C. It wouldn't also be significant enough because it's value is lower than 1.65.
  • Treatment B has even a negative z-score, so we can discard it without any much trouble.

So, we would finally pick the change made in Treatment C and move one on other kind of AB tests or new updates to improve our website more and more.


Just to summarize all contents we've dealt with in this article:

  • The conversion rate for each treatment is a normally distributed random variable
  • We want to measure the difference in performance between a given treatment and the control group.
  • The difference itself is a normally distributed random variable.
  • Since we only care if the difference is greater than zero we only need a z-score of 1.65, corresponding to the positive half of the normal curve.

Statistical significance is always very important for A/B testing because it lets us know whether we've run the test for long enough. Most of companies or even entrepreneurs always want to have results in the same day that the changes are applied without losing any moment. But as long as you don't have enough data to evaluate it, it's useless and dangerous to take a final conclusion on an AB testing. In fact, we can ask the inverse question, "How long do I need to run an experiment before I can be certain if one of my treatments is more than 20% better than control? One hour? One day? One week?"

This becomes even more important in situations or companies when money is on the line because it lets you quantify risk, minimizing the impact of potentially risky treatments.

I'd like you to comment on this article and ask freely any questions you might have regarding AB testing and other kind of experiments in online marketing. It's never easy to have so heavy conclusion on AB testings like the one I've introduced you in this article, so don't get frustrated if neither of the treatment groups have outperformet the control one: maybe the test hasn't been running for enough time, maybe the changes are so subtil that really don't change the user behaviour,  or maybe there's been an issue in the implementation of the AB testing and people doesn't see any change at all (human errors exists!).

So don't worry, make mistakes, learn about them and trust statistics, because that's the only way we can have some clues regarding anything that happens on the virtual world of the Internet.

See you soon on my next article!

PS: I let you here a spreadsheet with the example covered in this article so you can play with the numbers of visits and registers on your website, so you can focus on the results and not on the calculations 😉

AB testing sample