How to Determine Causal Relationships in Observational Studies


In the previous blog post, we looked at controlled experiments and saw what techniques we can use to properly analyze them.

Unfortunately though, we don’t always have the option to use a controlled experiment.

There are times when data has already been collected and we still want to properly analyze it…

…or other times when a controlled experiment is unfeasible or unethical (e.g. unregulated refusing of treatments for patients).

Of course, there’s still a lot of value contained within these observational study datasets, and it would be a waste not to make use of them, but we need to make sure we analyze these types of datasets properly or else our conclusions may be wrong or invalid.

How to Approach Observational Studies

Observational studies can be very powerful and telling because they allow us to investigate effects we may not be able to (or allowed to) make experiments for.

To make sure you approach observational experiments correctly though, let’s first go through the different types of observation studies you can encounter.

After that, we’ll look at the different components that you need to be aware of that are different from controlled experiments.

3 Types of Observational Studies

There are a few different types of observational studies that you can do; so to understand some terminology, let’s go into more detail for each of these and how you can approach them.

1. Case Control Study

A case control study is a type of observational study that looks at data at a specific instance in time. The aim is to understand why your group of units have separate outcomes. For example, you’re looking at your website data and find that some people have purchased your product and others have not.

comic obsessing over purchases

With a case control study, you’re trying to identify what concomitants may lead to the different outcomes observed.

One of the main indicators for analyzing concomitant strength in case control studies is odds ratio, which just compares the probability of the outcome happening given this concomitant versus the outcome happening without this concomitant.

For example, you could look at the odds ratio of the odds of a purchase when the user is already on your email list compared to the odds of a purchase if they’re not on your email list. This way, you can get a better understanding of the effect of the emails that you send and how being on your email list may impact purchase intent.

The further your odds ratio moves away from 1, the more correlated the events are.

  • If the odds ratio is exactly 1 then the events are independent.
  • If your odds ratio goes above 1, that means your concomitant is positively correlated to your outcome.
  • If it goes below 1, that means it’s negatively correlated to your outcome.

When analyzing the odds ratio, note that the lower bound is 0, whereas the upper bound is really infinity.

Therefore, make sure to understand that an odds ratio of 1/20 i.e. 0.05 and one of 20/1 (i.e. 20) are essentially the same; it’s just the direction that’s different.

Also, keep in mind that the odds ratio cannot tell you about the direction of the correlation — what comes first and what follows (e.g. buying users may be more likely to join your email list since they’ve purchased and would like to stay updated). 

This is why it’s important to always consider these values alongside the subject knowledge you have and just ask yourself: “Does this make sense?” and “is there a plausible explanation for this?”

It’s good practice to always have a relative and an absolute measure of effect size since sometimes, one can be deceiving. Therefore, alongside the odds ratio, you usually also want to look at an average effect size using differences in the mean outcome.

2. Cross-Sectional Study

In a cross-sectional study, you’re also going to be looking at data at just a certain point in time. In many cases, cross-sectional and case-control studies are interchangeable.

The main difference is that unlike cross-sectional studies, a case-control study tries to investigate only a very small part of the population because they’re looking at very specific outcomes.

For example, let’s say you’re looking at users that have converted and purchased your awesome product. Seeing as conversion rates are typically low single digits, you have a very small set of the population with this outcome.

When you further prepare your data for analysis (will be discussed further below), you may end up with an even smaller sample out of all the available data of people who haven’t converted to purchase your product.

In case-control studies, your units are usually just individual people, but in cross-sectional studies, your units can be individuals or a larger grouping such as a landing page.

cross sectional vs case control comic

In a cross-sectional study, you may be comparing the length of a landing page versus conversion percentage across many different landing pages that you have data for, whereas, on a case-control study, you’d be investigating conversions on a landing page on an individual level.

Cross-sectional studies focus on analyzing the whole population. Therefore, with cross-sectional studies, you usually also look at the prevalence of certain factors, such as looking for differences between conversion rates between different marketing campaigns.

Of course, you can still treat data on an individual level in cross-sectional studies, which means there’s a lot more you can do with cross-sectional studies, such as also diving down to try to identify causal relations between specific variables or matched subsamples.

At this point, though, the distinction isn’t very clear anymore and you could argue that your cross-sectional study has become a case-control study.

3. Longitudinal Study

What both of the above mentioned observational study types have in common is that the data is seen as a snapshot, and thus, you’re not tracking how individuals evolve with time.

In a longitudinal study, you collect several data points for each individual so that you can analyze their progression over time.

For example, in a case control study, you may be trying to understand causal relations from users that purchased a product after reading an email versus similar users that did not.

In a longitudinal study, you would look for causal relations over time by analyzing users over time. For example, you would look at the difference in user behavior over time between those users that purchased a product after a certain email versus those users that didn’t.

Longitudinal studies can be more powerful in determining causal relations than cross-sectional studies in determining causal relationships since we’ve got several data points for each unit and can look at the progression over time.

How to Prepare Concomitants in Observational Studies

To make sure we aren’t comparing apples to oranges in an observational study, we need to make sure that the samples in the treatment and control are equally balanced.

Why We Need to Balance Our Samples in Observational Studies

In controlled experiments, ensuring samples are comparable was much easier because all we had to do was random treatment assignment (and we can add on some blocking if we like), and we could be pretty certain that are samples are about equally distributed.

However, that, unfortunately, is not guaranteed in observational studies.

Let’s do an easy example first and say you were looking to compare the effects of drinking coffee daily on overall work productivity.

Would it be fair to just simply compare the output at work for all the people who drink coffee every day and all the people who never drink coffee?

Probably not, right?

There are way too many other factors that could potentially be at play: the type of work they do, what time of day they drink coffee, whether they love their work, whether they even have a job, etc. etc.

A more fair comparison between two equal samples would require you to find similar ‘types’ of people in both groups – the coffee-drinkers and the non-coffee-drinkers – in order to properly compare their productivity and output in the fairest way.

Essentially – if we have a Bob in one group, we ideally want to have a Bob in the other group – because that will give us much more accurate results for isolating the effect of the precise variable we’re looking to test for.

bob matching in groups comic

Of course, in real life, having a copy of one person in your other group is probably unlikely, so we can settle for having a Bob-like person in the other group.

Okay – that example was pretty obvious though. Let’s do a less obvious example.

Let’s say you want to compare the effects of different levels of education completion on job prospects.

We may falsely assume that we can just compare the people that completed high school to the people that completed college to the people that completed a Master’s degree.

If we just took pre-collected data on education level and current salary, your results would probably be inaccurate.

The reason for this is that your samples are probably going to be very imbalanced. For example, it’s possible that a lot of your subjects who finished college only had the opportunity because they came from a more privileged background.

They didn’t have to worry about making ends meet at home, and their families provided for them enough that they could just focus on school. On the other hand, in the sample that ended their education early, they may not have had much of a choice. It could be that some kids had to start helping out in the family business since the family depends on it.

The samples that graduate with a Master’s degree can not be compared to a sample that graduated only with a high school diploma as the subjects may be too different in their demographics, family history, disposable income, and other factors that affect the treatment assignment of your study.

The samples that you’re comparing should ideally be distributed in such a way that the samples are more or less random in all variables that could affect the results. In observational studies, this is not the case naturally – that is why you need matching to ensure that the samples are actually similar enough to be compared across groups.

Therefore, you need to balance your samples first, to make sure you’re actually comparing things that are comparable.

Those that finish college may have many different factors than those that only were able to finish high school. These two samples would not be comparable samples to compare.

More comparable samples would be those that finished college who had a similar profile to those that finished only high school.

To do this, you need to think about what you want to test and then find your relevant sample groups for each treatment.

What is Matching?

Once you’ve got your samples set up, you want to think about balancing the two samples. You can do this with a concept called matching.

Matching techniques are a bit more complex, so we won’t get into their formulas, and you can usually have your software or your code just do it for you. The idea of matching though is that you balance your treatment groups so that the distribution of concomitants is about equal in the different samples. 

A standard way to go about this is called pair matching, where you match each data point in one treatment to an about equal data point in the other treatment. 

Unfortunately, you can only match on the concomitants that you know though, and not on the ones that you don’t know. That means that there can still be imbalances between unknown, or unmeasured, concomitants. However, this is as good as we can do in these situations.

With matching, you’re at least able to guarantee that your known concomitants are close to equally distributed across the two samples.

If you’re satisfied with how the concomitants are distributed across the samples, you can now start your analysis. 

In case you’re not satisfied with the distribution, the best solution would be to get some more data that you can use to match to increase the number of data points that are similar in each sample.

Because matching tries to make these distributions more even, your sample size is going to be at most as large as the smallest treatment sample you have, since you’re basically picking out only a selection of the larger treatment sample so that that selection is similar to your other, smaller sample.

For example, using our school education example from above, let’s say we’ve got a sample of 200 people that have stopped education early and 5,000 people that have finished basic education.

If we’d use paired matching, that means for each of the data points in one treatment, we want to find a similar data point in the other treatment. That means our final samples would (in the best case scenario) be 200 for the people that have stopped education early and 200 for the people that have finished basic education.

This way, we’re able to create more similar sample sets, and the opportunity of the larger dataset is mostly just in providing us with more data points to choose from that we can use to match for the smaller dataset.

In cases when your data sizes are much more similar to each other, say about 500 in each group, then matching will systematically pair and delete parts of the data samples until it finds an appropriate subset that’s maximized balanced while still trying to keep as much data as possible.

Another great advantage of matching is that it’s a non-parametric approach.

What does that mean? It means you don’t have to worry about your samples fulfilling parametric assumptions like being normally distributed, which means it’s more widely applicable and less prone to errors from assumptions (specifically parametric assumptions).

Problems with Matching

Unfortunately, there are some problems associated with matching.

First, we’re reducing our samples to make them equal, but then we may not be able to analyze what we’re really interested in. For example, say our 200 people that have stopped education early are all from a less privileged family background. That means the 200 people in our sample of those that have finished education early will also be from a less privileged background.

Therefore, we don’t have the ability to analyze how stopping education early affects more privileged families, since our sample is not representative of that population.

Another problem to be aware of is that matching doesn’t guarantee an equal distribution of unknown or unmeasured concomitants.

For example, let’s now imagine that 90% of the 200 people in our sample that have stopped their education early were because they wanted to help or take over the family business. In that case, we would still consider all of the appropriate units in that sample fully employed.

Now imagine that in our matched sample of 200 people that didn’t stop their education early, only 70% of those got a job after finishing their education.

If we didn’t know or track whether or not each unit had a family business, but only tracked if they were fully employed or not, we may find the result that stopping your education early can lead to better employment rates.

We can’t generalize these findings because the running of a family business is a unique opportunity that isn’t given to every person; so we can’t say “everyone should quit high school because studies show that it leads to better employment opportunities” as not everyone may have a family business that they can help operate.

This is the type of sensationalized, inaccurate and generalized finding that is often broadcasted in the news and that leads to mass panic and false long-term beliefs.

false findings comic

As you can probably imagine, if there are important variables that have a different effect on the result, also called covariates, and we don’t measure account for these, our results will be off.

In the above example, if we had information on whether they had a family business or not, then we could have also matched on that variable and have a more matched sample and avoided the above problem.

A pretty big problem in observational studies is that, since we’re not able to guarantee an approximately equal distribution of known and unknown concomitants, our results may only be applicable to the sample that we studied, and we may not be able to generalize these results to the rest of the population of interest.

This is also referred to as a sample bias, where our sample is not representative of the full group that we want to extrapolate our results to.

samples generalization comic

Therefore, it’s usually customary to use observational experiments as indicators but to confirm results through controlled experiments.

In cases where controlled experiments aren’t an option, it usually requires several observational experiments, or large sample sizes to make more definitive conclusions, although even then, you can’t guarantee equal unknown concomitant distributions.

T-Statistic and Matching

In some cases, even in some scientific papers, it’s become standard to use the t-test to evaluate the result of matching. 

The idea is that if we can use the t-test to find if there’s a significant difference between two means, then we could also use it to check if there’s an insignificant difference.

If the difference is insignificant across the two samples then that property is probably equally matched across our treatment groups, right?

Unfortunately, it doesn’t actually work that way, and I’d highly recommend not using t-tests to evaluate the performance of the matching. The t-statistic is not only influenced by balance, but also by sample sizes and variances, so if the value is changing, you can’t just assume it’s because the balance is becoming better.

Instead, for each of the concomitants that are being matched, plot and compare the distribution across the resulting treatment groups and see how they compare to each other.

If you want to use a metric to measure your balance, you can compare the mean difference in means relative to the standard deviation of the difference in means. This value should not exceed 0.25.

You could alternatively also look at the propensity scores, non-parametric density scores, higher order moments, or a quantile-quantile plot. 

If you don’t know what one or any of these things are, don’t worry about it; you should already do pretty well with the above mentioned visual method and comparing the difference in means to the standard deviation of the differences. I just wanted to make sure to give you some extra options if you want to dive deeper.

Also, be aware that it’s very unlikely that you’ll get perfectly balanced samples, so in the end, there will always be residual concomitant imbalances. That’s fine and usually inevitable, and in the Residual Concomitant Imbalances section below, we’ll talk about how to address this.

How to Evaluate Observational Studies

Once proper matching has been done so that we are actually comparing similar samples, you can then start looking for causal relationships or other interesting indicators, such as prevalence values.

From the results of an observational study, it may be tempting to extrapolate that to the rest of your data, but you need to be careful here.

It’s very common that in observational studies, the sample we’re studying differs from the population that we’re interested in. We need to keep in mind that our results are only valid for the sample in the observational study, and the population represented by that sample. 

What does that actually mean though? Let’s think about this through an example.

Imagine you’re trying to understand how having a pool affects the happiness of guests at hotels. Since only upper-tier hotels usually have pools, it’s likely that your matched sample ultimately ends up comparing guest satisfaction in 4-star and 5-star hotels which either do or don’t have pools.

Your sample is therefore 4 and 5 star hotels, but your population may be all hotels.

Let’s say you now find that having a pool increases guest satisfaction dramatically, regardless of it’s a 4-star or 5-star hotel, and you’re approached by a hotel that wants to improve its guest satisfaction. You may be tempted to recommend them to get a pool because you found a dramatic effect, but what if that new hotel is a 3-star hotel?

Unfortunately, since your above observational study didn’t include 3-star hotels, you can’t extrapolate your results since your new subject of interest is not described by the sample you studied.

You may find that adding a pool in a 3-star hotel increases guest satisfaction, or that it does nothing, or that it lowers it. How could that be?

It could be that the reason that 4-star and 5-star hotels see dramatic improvements in guest satisfaction is that they’ve got all the bases covered and now they’re adding luxury items.

If your 3-star hotel can still improve by making the rooms more comfortable, improving the choices and quality of its food items, or hiring more staff, then the conditions that the 4-star and 5-star hotels you studied before are not matched by the 3-star hotel you’re looking at now.

It could be that adding a pool takes resources away from other areas and you may find that you don’t have the necessary budget to keep the pool warm and clean. Now, guests feel cheated because they missed out on a luxury they were promised because the pool was freezing or dirty, and guest satisfaction goes down.

dirty pool comic

So, the point is: always keep in mind the sample you studied in your observational study, and don’t try to generalize the results to cases that aren’t represented by the sample.

How to Deal with Residual Concomitant Imbalances

As we’ve seen from everything above, matching is a very prevalent topic when talking about observational studies. 

Although matching can do a good job of increasing the balance of concomitants across our different samples, it’s likely that the samples aren’t perfectly matched and there are still differences between the concomitant distributions across each of our samples.

These are usually referred to as residual concomitant imbalances, and they’re important to be addressed since these imbalances can still affect the outcome.

Although we’ve focused on observational studies here and have the advantage of randomization in controlled experiments, even in controlled experiments, it’s likely you’ll have some sort of residual concomitant imbalance. 

Therefore it’s important we try to assess their effects on the relationship we’re trying to investigate, regardless of whether they are observational studies or controlled experiments.

There are several ways you can approach this problem, but one of the most common ways is using a multivariate linear regression.

With a multivariate linear regression, you simply fit a linear regression model by including not only the independent variable you’re mainly interested in, but also the other concomitants that you have information on.

That way, your linear model will have a coefficient for each of your concomitants, and you can compare relative causal effects by comparing the size of your concomitant coefficients to the size of the coefficient of the independent variable.

Multivariate linear regression is a great way to control for the residual concomitant imbalances and you’ll usually just have the option to fit a linear model to your data and can just read off the coefficients without doing much extra work… but it’s important not to stop there.

Don’t just take the values of your coefficients and leave it at that, but also make sure to calculate an uncertainty for each coefficient. That way, you know how sure you can be of the effect strength and size.

Finally, at the end of each analysis that you do, you always want to think about “Do these results make sense”?

Remember, you’re using statistics to help you better understand a subject, but it doesn’t replace subject knowledge.

Ask yourself:

  • Is there a plausible explanation for this outcome?
  • Do my results make sense?
  • What happens if I perform the same experiment on an updated data set, will my results hold?


Observational studies are a very useful way for us to analyze data that wasn’t gathered in a controlled way. It gives us access to a lot more data and allows us to make good use of it and provides guidelines to make sure we minimize other sources of error.

However, at the end of all of it, observational studies do not have the same power or reliability that controlled experiments have. We cannot be sure about equal distributions of unknown concomitants, and we may not always be able to generalize our results to the population and instead, have to be satisfied with what our results say about our sample.

Whenever possible, it’s best to confirm the findings of an observational study by creating a controlled experiment around it. 

However, when this is not possible, it’s recommended that you perform the same study on several different data sets to make sure you’re getting similar results each time. This way, you can also better understand how your results may vary when the population represented by your sample changes.

Like this article?

Share on facebook
Share on Facebook
Share on twitter
Share on Twitter
Share on linkedin
Share on LinkedIn
Share on pinterest
Share on Pinterest
Scroll to Top