Skip to what you’re interested in reading:
- What are Histograms?
- How to Read Histograms
- When to Use Histograms
- How to Use Histograms
- How to Make Histograms in Python
- Limitations of Histograms
Next up in our Deep Dive into Data Visualization series comes histograms! As mentioned in our other blog posts on scatter plots and on box plots, data visualization is an instrumental part of many careers, including data scientists, data analysts, machine learning engineers, business analysts, marketers, product analysts and so on and so forth.
They are instrumental for good reason.
I’ll be the first to admit – raw data is not the most fun thing to look at, and worse yet, raw data is almost impossible to draw conclusions or make recommendations from.
Enter, data visualization.
Humans are largely visual beings who process images and remember images much better and faster than they do text; hence, data visualization allows raw data to come to life and communicate with us in our language of choice: via pretty pictures, basically.
So without further ado, let’s get into histograms, what they are, how to read them, when and how to use them, how to make them in Python, and finally, the limitations of histograms.
What are Histograms?
Histograms normally consist of an x-axis and a y-axis, and are made up of a series of bars, also called bins.
Histograms are visualizations that allow us see how the values of our data points are distributed. They show us which ranges contain a lot of data and which are more sparse.
Here is an example of a histogram:
The y-axis of a histogram always shows a measure of frequency. This measure of frequency can be either:
- An absolute count (literally a count of how many times it appears), or
- A relative count (a count done relative to how many data points there are in the data set).
In this case, I’ve generated some data to show a (very basic) possible distribution of daily steps taken for a set of students at university.
Let’s take a look at these bins in more detail. We’ll first give an abstract definition, and then we’ll consider the above example.
Each bin has a starting point, a width (or a size), and an associated count that’s represented by the height of the bin. The bin’s counts (the y-axis) is the number of values that fall within the region that a bin covers. The region that the bin covers starts at its starting point and goes up to, but not including, the starting point plus the width of the bin.
Also note that, generally, the bins in a histogram will all have the same width (size); however, this may not always be true.
Alright, so let’s use the above graph to give a practical example of each of the above terms.
If we count the number of yellow bars or boxes, which are usually referred to as bins for histograms, we’ll count 12 bins.
Each of these bins starts at a certain location; for example, the first bin starts at 1000, the second at 1250, the third at 1500, and so on. The last bin starts at 3750.
Each bin also has a width associated with it. In this case, all the bins are of width 250. Visually, a bin goes from its starting point to its starting point + its width, which we’ll call its endpoint. For example, our first bin goes from 1000 to 1250, the second from 1250 to 1500, and so on.
Each bin also has a height, and that’s also referred to as the count associated with the bin. The count is the number of data points that lie within its starting point going up to, but not including, its ending point.
For example, let’s say we have 6 data points with values [1000; 1100; 1249; 1249.9; 1249.999; 1250], how would these get distributed into the bins above?
- Bin 1 goes from 1000 up to, but not including, 1250.
- So the points [1000; 1100; 1249; 1249.9; 1249.999] would all fall within that bin.
- Bin 2 starts at 1250 and goes up to, but not including 1500.
- So the data point 1250 would fall into the second bin.
Therefore, for the above data points:
- Bin 1 has a count of 5, and
- Bin 2 has a count of 1.
Long story short: A histogram will show us how many data points are located within specific ranges, and is visualized through a series of bin.
How to Read Histograms
So when we have a histogram, how do we read it and get value from it?
When we’re looking at histograms, we mainly look to see how our data is distributed across the range of values our data points take on.
This way, we know:
- what shape our data distribution takes on,
- what range of values are most common,
- and what extent our values stretch out to.
Let’s consider the histogram from above again, where I generated a simple distribution of steps taken by a group of university students.
Our histogram bins have a width of 250 steps, and our lowest bin starts at 1000, and the highest goes up to (but not including) 4000.
We see that most of the students walk between 2000 – 3000 steps, and very few walk more than 3500 steps or less than 1500 steps.
We also see that the bin with the highest count starts at 2250 and goes up to 2500.
What all of this means to us is that we now have a pretty solid understanding of how active our students are. From our sample, we see that most students walk 2 – 3 thousand steps, with some slightly below, and others slightly above.
Side note for those of you that are unfamiliar with steps and counting steps: My girlfriend got me a Fitbit a few years ago, and I can safely say that it makes you strangely competitive about your steps per day and you may end up checking your step count like a crazy person pacing circles in your living room in attempts to hit your 10K steps a day.
But anyway: 2 – 3 thousand steps is about the equivalent of walking around 1 mile a day, or walking a total of 20 mins a day at an average walking pace.
So we can probably say that most of these university students are:
Safe assumption: They’re not moving much during their day.
In the How to use Histograms section, we’ll go into more detail of how exactly we can make use of this knowledge.
But for now, we can also see from the histogram shape above that we have a distribution that is mainly grouped around a central value (somewhere between 2000 – 3000) and probably only has 1 peak; this is also called a unimodal distribution.
We can fiddle around with some mean and standard deviation values and try to fit a normal distribution to the shape of our histogram, like in the following graph.
For the above plot, I’ve used a mean of 2500 and a standard deviation of 500.
It’s not perfect, but it’s pretty good. (Actually, this is the distribution I randomly generated the data from so the mismatch here is just due to noise coming from the limited sample size.)
Although you’ll often find that your data follows a normal distribution, this is not always the case.
Take a look at the following distribution:
In this case, we’ve got two distributions. The one we had before centered around 2500, and a smaller set of students centered just above 10000 steps.
In this case, we have something called a bimodal distribution, where our final distribution is made up of two separate distributions.
It’s very likely that there’s some explanation that separates these two groups from each other.
(Some of these 10K steps students probably got a Fitbit for Christmas and are tapping into their inner competitive side to hit that daily 10K recommended steps.)
Bimodal distributions aren’t always so clearly separate though; they can be much closer together, like in the following image.
Tweaking bin sizes
In this case, our second distribution is around 4500, but this is much harder to see. If we decrease the bin width, it becomes a bit more obvious though.
If we change the bin size from 250 steps per bin to 100 steps per bin, it’s much easier to see the second distribution around 4500.
Note also that our count has decreased, since a bin that before covered 250 steps is now split into several bins of 100. That means two bins which before covered 500 steps are now split into 5 bins covered each only a range of 100 steps.
Splitting our data from bins of size 250 to those of size 100 will mean that data points that were grouped together before are now split into separate bins meaning there are less data points in each bin, reducing the count in each bin.
Reducing the bin size isn’t always a good idea though. For example, if we make the bin size much smaller (each covering 2 steps), a lot of the otherwise obvious information disappears.
So changing the bin size is a good thing to play around with, tweaking in both directions, until you find a size that lets you best visualize how your data is distributed.
This is mainly just trial and error. You can try repeatedly cutting the bin size in half and, if you don’t see any new patterns emerging or you counts become too low, then move them back up to a size that is lets you clearly see the distribution.
Splitting by another variable
Be aware though that, at some point, if your two distributions move too close together, they’ll merge in the visualization, and there’s no bin size tricks you can do to change that anymore.
For example, if our other set are students are now centered around 3500 rather than 4500, this is what the distribution looks like now.
Even if we reduce our bin size down to 50, it’s still very hard to tell there’s another distribution at around 3500. In fact, if I wouldn’t have told you, you probably wouldn’t have noticed (I wouldn’t have either).
That’s because at this point, our distributions are too close together for us to confidently be able to tell them apart.
If we split our data by a separate variable, like if the students follow a step tracker or not, we could plot two histograms next to each other, like this.
Here we can once again differentiate the two distributions, but it was up to us to split the data into groups based on our knowledge of the data.
In the above case, we saw different ways that we can go about identifying and visualizing a bimodal. However, we’re not limited to just having a uni- or bimodal distribution, we can also have a multimodal distribution which is basically anything with more than 2 peaks.
In the following graphic, you can see a histogram showing a distribution with 4 peaks.
The peaks here are at around 2500, 3500, 4500, and 5000 daily steps. Again, in these cases, it’s common to have to tweak the bin sizes until you get a distribution that shows you a clear distribution.
In this case, the bin sizes look pretty good.
If we made them smaller we’d just be reducing our count in each bin without seeing any new patterns, and if we make them bigger, the two peaks at 4500 and 5000 daily steps start to merge and the peak at 3500 gets absorbed by the larger distribution peaking at 2500, as you can see in the below graph.
In all of the above cases, our distributions were showing normal distribution. Although this is a very normal distribution (hehe) to see in real world data, there are also other common distributions that have a peak somewhere and then fall off when moving away from the peak.
We won’t get into all of the different types of distributions since that’s more well suited for an article on statistics, but we still need to cover another important feature you can see in these types of distributions.
Take a look at the graph below.
Like before, we have a peak and then a decay away from the peak, but the difference in this case is that the decay is much longer in one direction than in the other.
In this case, we see our graph stretch out far to the right. The sides of the distribution are called tails, and in this case, we have a long tail stretching to the right. The above shape is also called a right skewed distribution.
If our tail was extending longer to the left than to the right, it would be called a left skewed distribution.
Skewness can be an important characteristic to be aware of because it can extend the range of values your data can take on before you’d consider them outliers.
We’ll talk more about how to work with skewed distributions in the example we go through in the How to use Histograms section.
Another distribution that you may encounter is a uniform histogram distribution, which is pictured below.
This type of distribution may seem strange to look at or illogical to think about, but consider the counts of rolling a six-sided die. Each side has equal chances of being rolled on, which results in the following uniform distribution.
The distribution may not always be perfectly flat so this is the general idea.
It’s not very common to naturally get this type of distribution, so if you see this or something close to this, you may want to check if your data was already modified for equal distribution beforehand already.
The final type of histogram distribution that we’ll look at is the cumulative distribution. Currently, every histogram that we saw so far has shown the count of the number of data points that fall into that bin.
There’s actually another type of histogram that we can make that shows the total (cumulative) number of data points up to and including that bin.
In cumulative histograms, each bin shows the number of data points in the data set that are either in that bin, or in any bin below that bin.
Let’s look at our first daily steps histogram from above again:
This is how the cumulative distribution of it would look like:
In the left plot, we show the total number on the y-axis, and in the right plot, we’ve scaled the y-axis to be between 0 and 1. That way, we can easily read off the percentage of students included up to that point (0 being 0%, and 1 being 100%).
In both cases, the shape of the curve is the same though, and the cumulative curve allows us to see how many of our data points fall below a certain bin.
For example, we can quickly read off the right graph that about 50% of all students walk at most 2500 daily steps. From the left graph, we would read instead about 250 of the students in our total sample walk at most 2500 daily steps.
Either way, cumulative graphs are really great for answer questions like:
- How many of our data points have a value of at most X, or
- How many of our data points lie above the value X?
When to Use Histograms
Histograms are a great go-to graph if you’re trying to understand how data in one of your variables is distributed.
With histograms, you’re mainly trying to get an understanding about the characteristics of one variable.
- If you’re trying to compare values between two data sets, then scatter plots would be a better graph to go it. You can read more about scatter plots in my blog post here.
- In case you want to investigate one variable, but want to understand it across different categories, then box plots or bar graphs is a great visualization method to use. You can read more about box plots here.
Usually though, you’ll be using a mixture of all of the above visualization methods when investigating data.
You may want to start with looking at a histogram of each of the variables you want to investigate; this may allow you to get an understanding of how your data within each variable is distributed, and if there are any special features you can spot.
Then you can start comparing across different categories using bar or box plots. In case you only want to look at a few categories (at most around 5), you can also split your histogram into different colors like we did above in the binomial distribution part of the How to Read Histograms section.
You’ll also make use of scatter plots to understand how two different variables behave together.
Through combining all of these different visualization techniques, you’ll get a good understanding of how your data behaves on the individual level, and then start comparing to see how your data behaves in context of the other data points you have.
How to Use Histograms
Let’s go back to our example about the daily steps taken by university students.
This distribution above shows the daily steps taken for a sample of university students. Since I generated this data, let’s assume also that this data was recorded daily over a period of one week, so that we have less random variation due to picking one specific day in the students’ schedule.
The histogram distribution gives us a very good idea of the base activity level of students, and allows us understand the range of data we’re working with, as well as how much variation we can expect.
In this case, our data:
- Ranges from one to four thousand,
- Centers around two and a half thousand, and
- Going about one thousand in either direction will contain almost all of our data.
If you’re familiar with normal distributions, this distribution actually looks like it follows a normal distribution. This is a very common distribution to get, and it tells us that data is grouped around some standard value but can vary around it.
In our case, we can do some analysis and further investigation to understand why our data looks like this.
We find, for example, that no student falls below 1000 daily steps because all students have to go to class, and so even students who don’t leave their dorm for anything other than class still end up with some steps through their journey to class.
Most students who are between two to three thousand daily steps may just be the step range required to leave their dorm room, go to class, grab a quick lunch, get home, and maybe meet up with some friends sometime during the day.
Our histogram has thus given us a numerical understanding of what kind of activity level is required based on the current lesson plan plus social life of our students.
If we were part of the health department at the university and we wanted to improve the base activity, we would now be able to make use of this new understanding to work towards a goal.
Let’s say we want to increase the average steps taken per student from 2000 to around 4000… We talk to the administration office and just ask them to ensure that in next semester’s schedule students shouldn’t have back-to-back classes in the same lecture room.
That way, we are
subtly forcing them to get up and move between classrooms after each class.
I’m sure they’ll love the new initiative to walk more and be healthier…. right?
If we sample our students again after this change, our new distribution may look like this:
With our new distribution, we see that the central value looks to have shifted up to around 3500 daily steps, whilst the rest of the shape remained fairly the same.
Of course, to do this properly, we’d need to keep a control group to make sure that it was the schedule change, not some other change, that caused this increase in steps. (I’ll go into this in more detail in a future blog post.)
However, as you can see, histograms can be a very useful way to look at all the data points in your data set.
You don’t just focus on one value like the mean, or a combination like mean and standard deviation, but you actually see the full shape.
A histogram allows you to understand important questions about how your data is distributed, such as:
- Does my data clump somewhere?
- If so, what region does it clump around?
- Why specifically does it clump around these values?
- What range of values does my data set cover?
- How many values are located towards the edges?
- How quickly does my distribution drop off compared to the central value, relative to the range of values the data set covers?
- Is my distribution skewed? If yes, in which direction?
And to each of these questions, you can then ask, “Okay, so what does that now tell me about my dataset?”
This way, you’re able to pick out a pattern or feature about your distribution, and then you can deep dive into the rest of your data to understand more precisely why this part of your distribution may look like that.
Working with Skewed Distributions
We’ve seen above what a skewed distribution looks like, and it’s common to have some sort of skew to your distribution.
Sometimes this skew is very minimal, and you won’t even notice it or need to really consider it, but other times, this skew can be pretty extreme and needs to be addressed if you want to use the data for further analysis or to feed into a model.
The skews that are most influential are the ones where the variable has no real upper limit.
For example, often when you look at annual salaries of a specific role or within a geographical location, your values will be clumped around a central region.
However, you’ll also have a very long right tail, since salaries can technically just always continue to go up indefinitely. This means, you’ll see the tail in the histogram stretch out to, for example, 10 times the value where most of your data clumps around.
You can see an example of this in the following plot:
When you have a skewed distribution, you may want to think of whether you can put a cut-off point somewhere, where every value above that cut-off value would be considered as that cut-off value instead.
For example, let’s say we have data on the annual salaries of everybody within a small, average town as well as the housing prices of all houses currently on the market within that same town. We’re trying to figure out which houses specifically each person can afford in this town.
The histogram above shows that our salary data is generally clumped around the $60,000 annual salary mark, and we know from our subject knowledge that houses in this cute, little town don’t ever list for more than $300,000.
If we want to understand how different types of people can afford houses, we may choose to do a cutoff at some salary to reduce the spread in our data.
So, for example, we can say that anything above $125,000 in salary, we’ll consider to be $125,000. Our reason for this is that anyone at or above that salary is basically in the same boat of being able to afford almost any place in this town if the houses in this town don’t ever list for more than $300,000.
This way, we can nicely reduce the length of our tail, and you can see the resulting distribution below.
After applying the cut-off, we can see that our tail is much shorter now, but we’ve also got another artificial peak at $125,000. This isn’t a big deal though since we know this is artificial and so we won’t make any extra conclusions about the sudden peak at $125,000.
When choosing a cut-off though, make sure that you have a proper reason for choosing that value. It doesn’t come down to the exact value, but the ballpark range is important, and these values that you choose depend entirely on subject knowledge for this particular set of data.
For example, in this instance, I could’ve just as easily chosen $120,000 or $130,000 and it probably wouldn’t make much of a difference. If I used something like $80,000 though, I may be making an assumption that isn’t correct, since some people (not all) may not feel comfortable buying very upper-end places yet.
Similarly, if I used $180,000 as the cut-off, I may have been too cautious if everyone at $140,000 already feels like everything is within their range.
Therefore, if you’re planning on using a cut-off based on seeing a skewed histogram distribution, make sure you have a more concrete reason for choosing that number that relates the cutoff value to the real world and/or your subject knowledge, rather than just choosing a random value.
How to Make a Histogram in Python
Creating histograms in python is very straightforward, and as usual, all that we need is Matplotlib. In case you don’t have any data to visualize, you can quickly generate some using the library, numpy, like so.
import numpy as np
randomData = np.random.normal(10,2,1000)
Here, we’ve created 1000 random data points that follow a normal distribution with a mean of 10 and a standard deviation of 2. All of our data points are stored in a simple python list (or in this case, numpy array) format.
To create a histogram from our data, we just do:
import matplotlib.pyplot as plt
That’s it! These few lines would give us the following graph.
Of course, there’s a lot of extra customization that we can do; for example, changing the colors of our bins, the colors of the bin edges, and the number of bins we have, adding labels, etc.
By default, the bin color and bin edge color are both the blue we see above, which can make it difficult to differentiate neighboring bins. Matplotlib also defaults to showing your histogram using 10 equally sized bins.
If we want to change the color of our bins to red, make our bin edges black, and use 20 equally sized bins instead of 10, this is how we do it:
import matplotlib.pyplot as plt
This histogram is already a lot more clear and we didn’t have to change our code much at all. Neat, right?
In case we want to use custom bins like [4, 5, 8, 10, 15, 16] then we can just replace the bins = 20 from above with these bins like so:
import matplotlib.pyplot as plt
Now you can see the bins start at the values you fed into python. This histogram isn’t very informative though.
Personally, I like to use custom bins that start and end at specific values, but are all the same width as that’s easier to draw conclusions from. You can easily do this using the range() method from python.
For example, if I want my bins to start at 5, go up to 16, and have a bin size of 1, I could just use range(5,17,1). This will create a range of values from 5 up to, but not including, 17, in steps of 1.
import matplotlib.pyplot as plt
Although the python range method is nice, I prefer the arange() from numpy.
It uses the exact same format but I can also take non-integer steps which I can’t do with range(). For example, I can go in steps of 1.5 from 5 up to, but not including, 17 using np.arange(5,17,2.5)
import matplotlib.pyplot as plt
As you can see, there’s a lot of customization you can do with very little code, and matplotlib doesn’t require you to do anything special to your data before you can create a histogram. It just takes the raw data and does it all for you.
Limitations of Histograms
Although histograms are great at giving you a detailed understanding of one of your variables, they’re not particularly good when you want to investigate or compare several variables at once.
Histograms are also not very useful when your data isn’t numerical, or you can’t transform your data into a numerical format. For example, if you’ve got data that’s made up of a bunch of categories, you’re better off using bar graphs than histograms.
Histograms also assume there’s some sort of order to the numbers you use. You need to have smaller and larger numbers, and it has to make sense to compare these numbers.
For example, if you’re looking at the distance of houses from the ocean, a lower number means the house is closer, and a higher number means the house is farther away. Additionally, all houses at a specific number (say 50km from the ocean) are farther away from the ocean than any house that has a smaller number (like 47km from the ocean).
So if you have numerical data where the order from high to low or low to high isn’t meaningful, or you’ve just transformed categories into numbers based on the order that they appear in, then a histogram won’t help you because the low to high order that a histogram creates for your data won’t make any sense.
So when using a histogram, make sure that the ordering of data makes sense, and use it mainly for understanding in more detail how one variable in your dataset looks like.
You can then do this for several variables, one at a time, and once you’ve gotten a better understanding of the variables you’re interested in, you can switch to other visualization methods like scatter plots or box plots to compare across the different variables.
And that’s it for our deep dive into histograms. I hope you enjoyed this article and have learned all that you could ever want to know about histograms!
Want more free help on getting started with data science?
If becoming a data scientist sounds like something you’d like to do, and you’d like to learn more about how you can get started, check out my free “How To Get Started As A Data Scientist” Workshop.
We go through everything we’ve covered in this blog post in more detail, dispel some common misconceptions, and give you a roadmap and checklist of what you need to do to get started to working as a Data Scientist.