How to Evaluate A/B Tests (aka Controlled Experiments) for Causation and Correlation

In the previous blog post, we looked at the terms causation and correlation and developed a deeper understanding of what exactly each of those terms means and what the difference between causation and correlation is.

In this post, we’re going to continue from where we left off last time and dive into how we can approach different types of datasets and go about analyzing them correctly.

More specifically, in the next two blog posts, we’re going to learn how to approach the two different experiment types that we can run into:

  1. Controlled experiments
  2. Observational experiments

This blog post will concentrate on A/B tests (aka controlled experiments).

As you will see, evaluating these experiments is usually not as simple as calculating a correlation and a p-value and calling it a day.

At the end of this post, you’ll have a good understanding of things you need to watch out for in controlled experiments.

Before we dive into the blog post, we’re going to cover some relevant key terms that you’ll need to know.

6 Key Testing Terms to Understand

To avoid confusion when talking about an experiment, we need to make sure we use the proper terms. This will help us make sure we’re considering everything correctly, otherwise, the results of our analysis could be meaningless, or worse, wrong.

When we’re doing experiments to understand causation, we first need to decide what we want to test. For example, an experiment could be to evaluate the effect of a change that we want to make to one of our landing pages.

1. Treatment

So what we could do is: create two landing pages, one with the change we want to make, and the other that has no extra changes made to it.

Each of these landing pages would be called a Treatment.

2. Control

The landing page with the change could be Treatment A, and the landing page without any extra changes made to it would be called Treatment B.

In this case, Treatment B is also our control group. It’s important to have a control group because that way, you can evaluate your changes compared to how it would have performed without the changes.

We will talk more about control groups and their importance in the next section.

3. Unit

In our landing page example, each Treatment will be seen by a number of page visitors. The first visitor might see Treatment A and the next might see Treatment B. Each page visitor that is assigned to a treatment is called a Unit.

4. Response

So our units are assigned to our different Treatments, and the goal of our experiment is to see which landing page converts better. For us, a conversion may simply be whether the page visitor clicks on the “Join” button on the page.

Our conversion of each unit, therefore, is the response that we’re testing for.

response is any variable that can be affected by our treatments; some examples included conversion, time spent on the page, scroll depth, social shares, etc. Variables that are not a unit’s responses are things like source, browser, or device used to access the page.

5. Concomitants

We also have the variables that are not affected by our treatments, called the Concomitants (weird word, I know. For the longest time, I read it as contaminants).

In our example, a concomitant would be age, gender, browser, interactions with us before the treatment started, etc.

So, in short, our experiment has units, each with their own concomitants, go through treatments and we want to evaluate how these change the response.

6. Samples

Finally, we also have Samples. A sample is just a group of units. These units could all be undergoing the same treatment, in which case, that is our sample for that treatment. Our sample could also mean the group of units going through our experiment.

So with the sample, we should always be aware of what sample is specifically being referred to.

What are Controlled Experiments?

Alright, so now that we’ve got the terminology out of the way, we can start digging into how to set up and evaluate experiments.

Controlled experiments are where we have control over how an experiment is set up and how it will be conducted.

The controlled experiment allows us to create an experiment that reduces the effect of the environment and thus more accurately examine the variable(s) in question.

A very popular example of a controlled experiment is an A/B test. In an A/B test, we randomly assign units (most likely, users) to either see Landing Page A or B, receive promotion A or B, or be enrolled in email sequence A or B.

We then want to find out, for example, if Landing Page A or B converts better, if promotion A or B results in more profit, or if email sequence A or B results in higher open or click-through rates.

But there are some important questions we need to be aware of when doing these tests:

  1. How can we be sure that these higher conversions aren’t due to some other properties of our users? Like users who receive email sequence A were more engaged to begin with?
  2. How do we know that this higher conversion didn’t come about by chance?

Ultimately, we want to know: how sure are we that there was an effect and how big was the effect?

How to Evaluate Controlled Experiments for Causation

What is Random Treatment Assignment?

Let’s tackle the first question first. When we’re running our controlled experiment, the best way to ensure that there aren’t hidden factors affecting our result is to assign treatments to units randomly.

And when I say random, I mean flip of a coin, roll of a die random, not you manually going through and saying A, B, B, B, A, B, A, A, etc.

This random assignment will result in the known and unknown concomitants balancing out. This means that they will all affect the response equally on their effects will average out. Pretty neat, right?

Fortunately, if you use any A/B testing software, it will do the random assignment for you (and if it doesn’t, you may want to use a different software).

So, if at the end of our experiment, we have our conversion rate of 15% for treatment (landing page) A and a conversation rate of 12% for treatment (landing page) B, then the average causal effect of treatment A is 3% (15% – 12%).

That means, on average, treatment A leads to an increase in conversion of 3%.

What is Statistical Hypothesis Testing?

Okay, so now that we’ve taken measures to minimize the effects of other known and unknown forces, but how do we know that the change in outcome isn’t due to chance?

To understand this, we can use tests called statistical hypothesis tests to tell us how likely the change is by chance. That way, we can decide if we’re willing to make the change based on how likely it is that the results are due to chance.

When talking about statistical hypothesis testing, we usually have two outcomes that are possible:

  1. There is no difference between the results
  2. There is a difference between the results

In statistics, the problem is usually approached using something called the null hypothesis, which says that we assume there is no difference between the treatments we’re testing.

We also have something called the alternative hypothesis, which says there is a difference between the treatments.

Notice how we don’t include anywhere how much this difference needs to be, since the main aim of hypothesis testing is to see if there’s a statistically significant difference between the two treatments. 

That is not to say that a bigger difference does not affect the statistical hypothesis test outcome, because it does, but rather, that a higher statistical significance does not necessarily mean a larger difference in the outcome.

  • If we find that there is a statistically significant difference between the outcomes of the treatments, then we reject the null hypothesis and accept the alternative hypothesis.
  • If there’s not enough evidence for a statistically significant difference between the outcomes of the treatments, then we accept the null hypothesis.

So now that we know what we’re aiming for, let’s see how we can go about doing this.

Using the Two-Sample Student’s T-test

In cases where we only have two treatments that we’re testing, like landing page A vs landing page B, a very common test to run is called the student’s t-test. In this case, specifically, we would use the two-sample t-test (since we’re comparing two samples).

The student’s t-test is one of the simplest tests you can perform, and it can be very powerful.

Before we jump into it though, let’s quickly outline its assumptions so that we know when we can and can’t use it:

  1. The means of our treatment samples should follow a normal distribution,
  2. The variances of our treatment samples should be equal, and
  3. Each unit should be sampled independently from each other.

So what does this mean for us?

Well, fortunately, some powerful mathematical ideas (like the central limit theorem and Slutsky’s theorem) mean that if we use random assignment of treatment to units, we’re basically all set. So again, random assignment saves us a lot of work.

If you want to be mathematically correct, you could use the Welch’s T-test, which doesn’t care about equal variances, but in practice, it usually doesn’t make a difference.

The one thing to watch out for is assumption 3 when our data isn’t independently sampled from each other. Again, we don’t have to worry about this with a random treatment assignment, but we will come back to this later when random assignment is not an option.

Okay, so back to the student’s t-test.

The student’s t-test gives us three values:

  1. T-statistic
  2. Degrees of freedom
  3. P-value

We won’t get into the formula here of how you can calculate these values since you’ll most likely have a software calculate it for you.

Instead, we’ll focus on understanding the result.

What is the P-value?

The t-statistic and the degrees of freedom can be used together to give us the p-value, so what exactly is the p-value?

Remember, in a statistical hypothesis test, we start off with a null hypothesis, which just means that we initially assume that there’s no difference between our two treatments.

The p-value then tells us how likely it is that any difference in average causal effect that we see is due to chance.

The p-value ranges between 0 – 1, where 0 would mean there’s a 0% likelihood that it’s due to chance at 1 means it’s 100% likely that it’s due to chance.

If you have a p-value of 0.2, this means that it’s 20% likely that the difference in average causal effect that we see is due to chance. In other words, on average, every 1 in 5 (⅕=20%) tests that run will actually have its difference come purely down to chance.

That doesn’t sound too good, right? That’s why you usually look for lower p-values.

It’s become a standard (although a highly discussed standard) to look for p-values of at most 0.05. That means that there’s only a 5% chance (i.e. 1 in 20) that your results were due to chance.

This threshold value that you use is called the significance level.

Of course, choosing the right p-value is also up to you. If you only run a couple of tests and change your landing page maybe 3-4 times a year then 5% seems fine.

However, if you’re constantly iterating your product and making dozes, or even hundreds, of changes a year then you’ll have a good chunk of changes that were made (or not made) based on those 5%. In those cases, you’ll probably want to go lower to 1%, or 0.1%, depending on what exactly you’re doing.

What happens if you have more than two samples?

In some cases, though we may have more than two variations that we’re testing. Maybe you want to test if adding a series of changes affects conversion.

So instead of having 2 landing pages, you maybe now have 4:

  1. The control
  2. Call-to-action copy changed
  3. Call-to-action copy changed + Submit button color changed
  4. Call-to-action copy changed + Submit button color changed + Product benefits location changed

To test if each of these pages is performing differently from each other, you can then use the ANOVA test.

The ANOVA test can be thought of as a generalization of the T-test when you have more than 2 sample groups. The ANOVA test will tell you if all of these 4 pages actually have different conversion rates from each other, or if some pages are performing the same.

What are the Type I + Type II errors and Statistical Power?

Before we conclude this section on statistical hypothesis testing, let’s address three more important terms that often come up when talking about hypothesis testing.

  1. Type I error
  2. Type II error
  3. Statistical Power

As mentioned above, when we’re testing for statistical significance, we get a number from our p-value which tells us the likelihood that the difference we see is due to chance.

In other words, it also tells us the chance that we wrongly reject the null hypothesis. This is also known as a Type I error and tells us about the false-positive rate, meaning we wrongly conclude there is a difference (positive) even though in reality there isn’t (false).

However, there’s also another important error to know of, the Type II error.

The Type II error tells us how likely it is that we wrongly accepted the null hypothesis, and this is also known as a false-negative.

What is Statistical Power?

So with the p-value, we have our number for the Type I error, but how do we get our number for the Type II error?

This is where statistical power becomes important.

Statistical power tells us how likely it is that we didn’t make a Type II error; so to get the chance of making a Type II error, all we have to do is find the statistical power and then take 1 – statistical power.

So how do we find the statistical power?

The formula for statistical power is just Pr(reject null hypothesis | alternative hypothesis is true), or, in words, it’s the probability (Pr) that we reject the null hypothesis given that the alternative hypothesis is true.

However, since we don’t actually know truly whether the null hypothesis is true or not, we only know the likelihood that it is false from our p-value.

Our statistical power, therefore, depends on:

  • Desired significance level
  • Sample size
  • Effect size (see Effect size section below)

This means our statistical power then gets rephrased to say “for our current sample size and statistical significance, what’s the likelihood that we falsely reject the null hypothesis given an observed causal effect of  X?”, where X is the average causal effect we observed.

The statistical power is another common term you’ll see pop up in your software when you’re running tests. The default acceptable value for statistical power is 80% (0.8), paired with the statistical significance of 5% (0.05). But as before, you can also adjust these values based on your needs.

Another use of statistical power comes into play when wanting to calculate sample sizes. Since we can calculate the power from the desired significance, given sample size and effect size, we can also calculate the sample size required to observe a specific effect size with the desired significance and statistical power. 

So now that you know where it comes from, I hope you feel more comfortable evaluating how your experiments are going based on the statistical power and statistical significance you get.

What is the Effect Size?

Now that we know how to determine if the results change between different treatments, another important question is how large the changes are between the samples. This is what the effect size is for.

The effect size assumes that the difference in your results is significant, and based on that, looks to answer how large is the effect.

Effect sizes can be communicated using measures like:

  1. the average difference of treatment result,
  2. standardizing average difference of treatment result, or
  3. a correlation value.

The Difference Methods

The first method, looking at the average difference in the result, is also known as difference-in-differences.

To get the effect size using this method, we simply find the average change of the response variable between experiment start and end for our treatment, or each of our treatments, and then subtract from it the average change of the response variable between experiment start and end for the control group.

This method is handy since it’s very straightforward to use and understand.

Another method that we can do is scaling the above average difference by the pooled standard deviation, giving a standardized average difference also known as Cohen’s d.

This then gives us a measure of the difference while controlling for the amount of variance seen in the result of each treatment.

Although Cohen’s d doesn’t range between a specific series of numbers as the correlation coefficients do, the effect size can still be interpreted in a similar way to correlation coefficients.

As a general rule of thumb, a Cohen’s d of 0.2 is a small effect size, 0.5 is medium, 0.8 is large, 1.2 is very large, and 2.0 is extremely large.

Cohen’s d also has the advantage that it’s easier to compare effects across different experiments since the effect is standardized to the amount of variance seen in the result of the treatments.

Correlation Methods

The correlation strength gives a value for how strongly two variables are related.

The two most common ways of calculating these are using the:

  1. Pearson correlation coefficient
  2. Spearman rank correlation coefficient

Both of these values range between -1 and 1, where -1 means a perfect negative relationship and +1 means a perfect positive relationship.

I won’t go into more detail into what these numbers look like in a graph since I went in-depth into that in part 1 of this blog post series, and we also looked at this in the post on scatter plots

So what’s the difference between the Pearson and the Spearman rank test?

The Pearson r correlation coefficient measures linear correlation, whereas the Spearman rank also measures non-linear correlation.

Now you may wonder: “if the Spearman rank also allows me to do non-linear, why should I ever use the linear version?”

We can see that in both of these images, the Spearman rank is equal to 1, saying that as x increases, so does y. Yet the Pearson correlation coefficients of the two plots are already different and not as close to 1.

In this case, the Spearman rank tells us that both of these graphs are increasing in y as x increases, and, in the eyes of the Spearman rank, these graphs aren’t really distinguishable. 

Now, to be fair, we can’t really know what each graph looks like just using the Pearson correlation coefficient either, but it does already tell us that these graphs are different and that the left graph is further from being linear than the right.

Both of these examples aren’t very realistic though, and once we start to add noise, the Pearson r and the Spearman rank are much more similar, as you can see in the graph below:

So, at the end of the day, it again comes down to what exactly is best for your situation. You can see how the two coefficients behave differently.

To make your life easier, you can ask yourself: “Do I 1) care about a consistent increase, or 2) am I more interested in checking for a continual increase, even if the amount of increase may vary depending on where I am on the graph?”

If you care more about the first, then use the Pearson correlation coefficient, and if you care more about the second, then use the Spearman rank.

If you’re still undecided, you can, of course, also use both.

Another important thing to know is that if you do take the Pearson r, then another option open to you is looking at the r^2 value (i.e. the square of the Pearson r). The r^2 value, also known as the coefficient of determination, tells you the proportion of the variance in your dependent variable that can be attributed to the independent variable.

I hope that this article has given you a little more insight into the things to look for when you’re conducting a controlled experiment.

However, you won’t always be able to conduct controlled experiments or you might get data that has already been collected; in those cases, you need to approach the problem differently as they are observational experiments.