Skip to what you’re interested in reading:
- What are Box Plots?
- When to Use Box Plots
- Understanding the 6 Main Components of a Box Plot
- How to Read & Use Box Plots
- How to Make Box Plots in Python
- How Different Data Distributions Look Like as Box Plots
- Limitations of Box Plots
Continuing on in our data visualization series… last week, we covered scatter plots and this week, we’re moving on into the elusive box plot.
In this post, we go over what box plots are, the 6 key components of each box plot, when to use box plots, how to make them in Python, how to understand them as well as their limitations.
What are Box Plots?
Box plots, also called box and whisker plots, are the best visualization technique to help you get an understanding of how your data is distributed.
Data distribution is basically a fancy way of saying how your data is spread out.
A box plot allows you to easily compare several data distributions by plotting several box plots next to each other. (We’ll see examples of this below.)
A standard box plot looks like this:
Note that it doesn’t matter if your box plot is oriented horizontal or vertical; that’s left up to your personal preference. (I prefer vertical personally.)
In the above plot, you can also see all the key components necessary to create a box plot.
The 6 main key components to a box plot are:
- Quartiles and the Interquartile Range
- (Optional) Notches
We will go into each of these individual components in more detail below.
When to Use Box Plots
Before we dive into the details of what each of those labels in the graphic above means, let’s first discuss when you actually should use a box plot.
You should use a box plot when:
- you want a quick statistical overview of how your data is distributed within one data set
- you want to compare the distributions of several different data sets
Now you may be thinking, “What about histograms, Max? Those are fantastic for seeing how your data is distributed.”
Histograms are great, but they don’t work as well if you’re comparing 10 different data sets and need to know all the key statistical terms (that we’ll go into more detail in the next section) for each data set. With a histogram, you have to make educated guesses on what the median is, where the inner 50% of your data is, etc based on looking at the graph.
Let’s do a quick example.
Let’s say you are looking to compare the amount of cookies sold by 9 different boy scouts troops.
Let’s take a look at what that would look like as histograms and as box plots.
That’s a whole lotta histograms and not a whole lotta insights at first glance.
Let’s look at box plots.
Already – the comparisons are easier to make between the box plots.
One quick observation is that group D… seems to have sold the least amount of cookies by far.
So you can see the power of box plots when it comes to comparing and when it comes to getting a good overview of the data distribution for multiple sets of data.
With a box plot, it’s all just laid out for you clearly, and you can very quickly see differences between the data sets.
My recommendation? Use a box plot if you’re looking to compare different data sets and see how they are generally distributed.
Understanding the 6 Main Components of a Box Plot
Let’s go into each of these components in more detail.
1. Quartiles and Interquartile Range
The quartiles split our data into 4 equal buckets to allow us to quickly see how concentrated our data is.
The interquartile range (IQR) tells us about the spread of the inner 50% of our data and how densely packed the data around the median is.
The quartiles are a general statistical definition. The goal, as mentioned above, is to equally split your data into four buckets containing equal amounts of data points each.
- The first quartile (Q1) is the region that contains the first 25% of all data (0 – 25%),
- the second quartile (Q2) is the region that contains the second 25% of all data (25 – 50%),
- the third quartile (Q3) is the region that contains the third 25% of all data (50 – 75%), and
- the fourth quartile (Q4) is the region that contains the last 25% of all data (75 – 100%.
Box plots explicitly use Q1 and Q3 to define where the box starts and ends.
The first quartile and third quartile are indicated by lines that show the end of the first and third quartile. This central region (from the end of Q1 to the end of Q3) is visualized as a box (hence the name “box plot”).
Another way of looking at it is that the first quartile is the point where we’ve reached 25% of the data below the median value, and the third quartile shows when we’ve reached 25% above the median (more on medians below.)
The distance from quartile 1 to quartile 3 is called the interquartile range (IQR), and it tells you how dense the central part of your data is.
A low IQR means your data is very densely packed in the center, and a high IQR means it’s more sparse.
Important to keep in mind: since the IQR does not depend on anything other than the end point of Q1 and Q3, the IQR can not tell us anything about our distribution other than the range between Q1 and Q3. Don’t use it to infer about how your data is distributed inside or outside your box.
The Minimum/Maximum value is the cut-off value you still consider to be part of the “normal” range of values. Any point beyond the minimum/maximum is considered an outlier.
Let’s discuss the minimum and maximum together since the only difference between their explanation is if they’re at the top end of the data or at the bottom end.
Note: To make the explanation easier to read, we’ll just use the word minimum, but you can just replace that word with maximum when reading to get the explanation for the maximum.
The minimum, strangely enough, is actually not always the actual minimum of the data.
In box plots, the minimum is basically defined as the minimum value in our data that still makes sense to include as part of the distribution yet is not an outlier.
Here, we are saying the minimum depends on what we consider an outlier to be, when the same can be said about outliers – they depend on what we consider the minimum to be.
We’ve got ourselves a classic chicken and egg problem here.
And the truth is – there is no universal answer that works for all data sets.
But to help you figure this out, let’s look at 2 different scenarios of data distribution you could have.
In scenario one, you have a data distribution where there are clear outliers or data points that are completely dissimilar to the other data.
These situations are easy – just separate out the outlier.
However… what happens if there is no clear outlier?
Things get a little iffy in scenario two; when the concentration of data drops off gradually, it’s harder to make the decision of when to ‘cut off’ your data and consider certain values as your outliers.
In these scenarios, it comes down to your subject knowledge, trial and error, and your subjective opinion in making sure you are producing the most informative box plot for your data.
But to give you some useful tools, here are 2 different methods you can use to determine the minimum and maximum for your box plot.
Method #1: Interquartile Range Multiple
- Minimum = Q1 – 1.5 * IQR
- Maximum = Q3 + 1.5 * IQR
This just takes the length of the interquartile range and multiplies it by 1.5 (or any other value if you think a different one is better to use in your situation). This can be useful because it’s a very easy definition, but it does cause problems if the central part of your data is very concentrated or, alternatively, very sparse.
In both extremes, you’ll either get a minimum that lies way too close or way too far from the center of the distribution, giving you either way too many, or way too few outliers.
In case your data is skewed, i.e. it extends much longer in one direction than in the other, the definition can be extended to be a smaller multiple of the IQR for the denser part of the distribution, and a larger multiple of the IQR for the longer tail part of the distribution.
Method #2: 1st and 99th percentile
Another definition you can use is the 1st and 99th percentile (or another pair of lower and upper percentiles that you feel is a good definition).
This way, you become independent of the rest of your distribution and are solely considering the edges. This also means that the minimum and maximum can have different distances from the median value so your whiskers (see below) can be different lengths.
This definition is nice because it gives you a standard to use, which only considers the most extreme values, and basically says everything outside of that standard is an outlier.
How do you decide which to use?
Both of these methods have their issues.
Again, they’re just general statistical approaches and don’t have a lot of interpretal meaning behind them.
It’s like giving someone who scores as 89% on a test a B and someone who scores a 90% an A. There’s effectively no difference between these two scores yet they’re labeled differently.
Similarly, if you have two values on just opposite ends of your minimum or maximum, the only thing that makes one an outlier and the other not is a cut-off based on a general definition that doesn’t include any actual knowledge about your data.
So which definition should you use?
This depends largely on you, your preferences, and what works for the specific situation.
As I mentioned above, in these scenarios, it comes down to your subject knowledge, trial and error, and your subjective opinion in ensuring that you are creating the most informative box plot for your data.
Although I’ve just mentioned some of the downfalls of using a cutoff based completely on some random statistical definition, there’s also a lot of pros to it since using either of the methods above is a very easy and quick way to define the minimum and maximum, and it generally works.
That’s the thing about statistics: it’s all about the “general” stuff.
If you notice that the general definition is too large or too small, you can also tweak the values until the graph looks more suited and clear.
Outliers show you what some of your most extreme values are but they are not representative of your data set. Usually, you separate off outliers from the rest of your data so that you have a visualization that shows you where most of your data is.
Of course, if you have specific values in mind for when outliers begin based on your knowledge of the field the data is on then you should use those, otherwise I wouldn’t worry too much about finding the “perfect” cutoffs, because it won’t change the end result of the visualization much.
Outliers show non-representative values that your data can take on.
You can then show these as individual points to get a clearer picture of what the normal range of values, and which points are outside of this normal range.
An outlier is any point that extends lower than the minimum or higher than the maximum. In box plots, they’re shown as individual data points and fully depend on how you define your minimum and maximum.
The Median splits our data into two equal halves, which shows us where the most central region of our data is, and also lets us compare the spreads in the two halves, or 4 quartiles, against each other.
You may be familiar with the conventional meaning of the word, ‘average’ as the sum of all numbers divided by the amount of numbers.
But in reality, there are actually 3 different types of averages.
For box plots, we always display the median and we use it as our average of choice.
The median value is the point that lies in the middle if you sort your data from lowest to highest (or highest to lowest). It’s the point that has an equal number of data points both above and below it.
When you have an odd number of data points, like “1, 2, 3” then the median value is easier to find since only one point has an equal number of data points above and below it. In this case, the median value is 2. It has one data point below it (1) and one above it (3).
In cases where we have an odd number of data points, like “1, 2, 3, 4” then the median is not so obvious. Both 2 and 3 have problems since 2 has one value below it (1) and two above it (3,4), similarly 3 has two values below it (1,2), and one above it (4).
In this case, you can either use either of the two values for the median, so 2 or 3, or you can take their average, so (2+3)/2 = 2.5. Both definitions are fine and they won’t change your box plot unless you have very few data points, at which point it may not be a good idea to do box plots anyway.
The Whiskers indicate the region outside of the central box where we can still expect to find data.
The whiskers are what the lines that go from the minimum to the end of Q1, and from the end of the Q3 to the maximum.
These whiskers show you how far your data extends from the end of the box on either side until you start reaching your outliers.
Note that, depending on your definition of the minimum and maximum, your whiskers can have different lengths as the minimum and maximum can be at different distances from the end of the first quartile (for the minimum) and third quartile (for the maximum).
This is where the whiskers part of the name box and whiskers plot comes from; although it’s commonly left out anyway and just referred to as a box plot.
6. (Optional) Notched box plots
The Notch indicates the uncertainty of our median, and gives the range of values our median can still take on, since we don’t have an infinite sample size.
You can also add a notch to your box plot which, if we do it to our box plot from above, looks like this.
You can see that there’s now a part of the central box that has become angled inwards towards the median on either side.
The goal of the notch is to indicate the most probable region that median value can lie in, based on the data you have now and what you would expect if you continued gathering more data. More specifically, it shows the 95% confidence interval around the median.
A notch is useful because it informs you about the uncertainty in your data, and it also makes comparing several box plots easier. Take a look at the following box plots:
Above, we have two cases of two separate box plots. In the left plots, the two notches cover a similar region, telling us that we can’t be sure which group of data has the lower/higher median value.
In the right plots, the notches clearly show that it’s very likely that one box plot’s median is higher than the other.
A more statistical way of expressing this is if two notches of two different box plots do not overlap then there’s evidence for a statistically significant difference between the median values of the two different data sets.
The length of the notches from the median are defined as “1.58*IQR/sqrt(n)” where “n” is your sample size and “IQR” is your interquartile range. Therefore you can see that the more data we have, the more confident we become of where the median actually is.
How to Read & Use Box Plots
Now that we’re familiar with all the different components in a box plot, let’s go into an example so that we can properly examine a box plot.
The first thing that probably sticks out to you in a box plot is the central box.
Notable features to look for when it comes to the central box are:
- Where is the median located?
- How does the box look on either side of the median?
- How large is the box?
- How long are the notches (if it’s a notched box plot)?
Let’s apply this to an example. Let’s say there are two classes – Class A and Class B. At first glance, there are no big differences between Class A and Class B.
The following box plots shows the final test results of two school classes, Class A and Class B, the “N” in the plot means the sample size.
Take a look at the box plot, give it some thought, and see if you can use the above points to make some conclusions.
Try to answer the question: is one class scoring better than the other?
Okay, so immediately we see that Class B’s median is higher than Class A’s, and Class B’s box is, in general, higher than that of Class A.
However, we also notice that Class B has 10 less people than Class A which is pretty significant for sample sizes of 20 – 30 students per class.
Class A also has some high performing outlier students that both got close to 100 on their tests.
So this raises a question that is very common when dealing with data:
Is Class B actually doing better than Class A, or are the results due to chance?
Here are some factors to take into consideration when comparing box plots:
- Size of each box
- Position of the median
- Size and overlap of the notches
- Whether the whiskers are abnormally long in one direction
If you take a look at these factors, you may notice that Class B has a larger box and longer notches, indicating more uncertainty and spread, which is pretty typical of smaller sample sizes.
However, note how there’s no overlap between the two notches; this should indicate to us that we can be pretty certain that the difference is significant and that the result is not just due to chance.
This means that we can be quite certain Class B is doing better than Class A.
And now, we should try to find the reason of why Class B is doing better than Class A.
Some example questions you can ask yourself to do further investigation in are:
- Is the better result of Class B due to a smaller class size?
- Is there a difference in the teaching style between the two classes?
- Is there a difference between the time of day/day of week each class took the test? If so, what effect could that have had?
Or… it could just be that Class B has more studious students and less trouble-makers.
Of course, there are many more aspects to investigate from, and what you’ll probably find is that it may be a combination of several different reasons which resulted in the difference.
One more thing I’d like to point out about the above graph is that the data I generated this distribution from actually has the same standard deviation. Here you can see the same graph as above but I’ve upped the sample size of Class B to be equal to that of Class A.
Notice how the box lengths are almost equal now. Therefore it’s very important to be aware that you can get a lot more fluctuations from smaller sample sizes, and you have to be careful when making conclusions about data distributions when sample sizes are low.
I hope you feel more comfortable using and reading box plots, and deciding when to use them based on what conclusions you are hoping to find.
How to Make Box Plots in Python
Making box plots in Python is very easy, we’ll be doing it using a very popular data science programming library called Matplotlib.
In case you don’t have any of your own data to play with or visualize, don’t worry, we can use the library numpy to generate some random data for us.
Specifically, we’ll create data that follows a gaussian distribution because that type of data is best suited for box plots, I’ll show you why in a second.
import numpy as np
mean = 5
stdev = 3
gaussData = np.random.normal(mean,stdev,size=200)
Great. Now that we have some data, let’s plot it on a notched box plot. In its most simple form, we can do this super quickly like this:
import matplotlib.pyplot as plt
Neat, right? The default length of the whiskers in Matplotlib here is “1.5*IQR”, and we can see that anything above that shows individual circles, indicating the outliers.
Of course, you should still add titles and labels at the very least, so let’s do that now alongside also creating several box plots on one graph. We’ll use our code from above to create another set of data with a mean of 8 and a standard deviation of 5, which we’ll call “gauss2Data”.
Let’s also make the box plots horizontal instead of vertical this time, so switch things up.
import matplotlib.pyplot as plt
plt.title(“Horizontal notched boxplots of 2 gaussian distributions”)
As you can see, box plots are extremely easy to make using Python, and they’re just as easy to customize.
How Different Data Distributions Look Like as Box Plots
Okay, so we know what box plots are and we know how to make them in Python, but you may not be completely clear yet as to how different data distributions transform and look like when you create a box plot.
Let’s take a look at how some different data distributions look like in box plot form.
Here are the 4 different distributions we’ll consider:
- Linear distribution
- Quadratic distribution
- Normal distribution
- Cumulative distribution
Now let’s plot each graph’s box plot next to it and see what it looks like:
Notice how hard it is to differentiate the cumulative and the linear distribution (to be fair, they do have a similar shape).
Because the average that is being used in a box plot is the median, its position can actually tell us a lot about the underlying distribution. If the median is much closer to one side than the other, that means we’ve got a lot more data on the side it’s closer to.
Since we know that 25% of our data in either direction is contained from the median to edge of the box (the box contains 50% of our data), we can also use that to infer how densely the data is distributed on either side.
For example, in both the linear and cumulative plot, the median value is located closer to the right edge of the box than the left, indicating that the data towards the right of the median is more densely packed.
For the quadratic box plot, we see the opposite: the box is very long, indicating the data is very sparse near the median value.
Note: There is a limitation with the whiskers if you use whisker lengths dependent on the interquartile range. Let’s take a look at linear distribution in more detail, and also add just one extra datapoint at x = 6 to see what happens.
The two distributions and their box plots look like this:
Hmm, the two box plots look almost identical except our right whisker has become much longer. This can deceive us into thinking the data is more distributed fast the right edge that it really is.
So although box plots are really great at giving us a great overview of how our data is distributed and letting us easily compare different datasets, we need to keep these types of effects in mind.
If you move away from using a multiple of the IQR for your whisker lengths and instead use, for example, the 1st and 99th percentile to define the minimum and maximum value, this issue will be resolved. However, this means you’ll have to do extra modifications to your plots.
Making extra modifications isn’t a problem if you know what type of problems you can run into, but it’s very difficult to know what to do if you don’t, so it’s important to be aware of potential limitations and problems you may run into when using box plots.
Limitations of Box Plots
As with all data visualization tools, there are times when they shine and times when they prove to be problematic.
We’ve just seen an example of this above when we added one extra data point to our distribution and how the whiskers stretched out to match it.
We’ve also seen how different distributions look on box plots, and how that limits our understanding of the data.
So keeping these issues in mind, it’s important to remember:
Box plots are best used on normal (gaussian) distributions, because they allow us to get a good sense of our distribution without being deceptive to high concentrations of data past the edge of our boxes.
Specifically, they are best for normal distributions with one peak. Once we have more than one peak (like a bi-modal distribution), we run into problems similar to those we saw when looking at a box plot for the quadratic distribution (that we saw in the section above).
For non-normal distributions or for normal distributions with more than one mode, you should consider using violin plots instead as they’ll help give a better understanding of these more complex distributions.
Also, generally be aware of making conclusions with very low sample sizes. In the “How to Read & Understand Box Plots” section, we saw that, in the first graph, Class B’s box was wider than that of Class A, possibly indicating a larger variance.
However, when we made the sample sizes equal, this effect went away, and we saw that this effect came about by chance due to the very small sample sizes used.
And that’s about it! That is a basic guide and a basic look into box plots, how they’re used, what they look like, and how to make them. I hope this was useful!
Want more free help on getting started with data science?
If becoming a data scientist sounds like something you’d like to do, and you’d like to learn more about how you can get started, check out my free “How To Get Started As A Data Scientist” Workshop.
We go through everything we’ve covered in this blog post in more detail, dispel some common misconceptions, and give you a roadmap and checklist of what you need to do to get started to working as a Data Scientist.