Skip to what you’re interested in reading:
- What are Scatter Plots?
- When to Use Scatter Plots
- How to Create Scatter Plots in Python
- What to Use Scatter Plots For: 3 Applications of Scatter Plots
- 1. Identifying Clusters in Scatter Plots
- 2. Identifying Correlations in Scatter Plots
- 3. Using Higher Dimensional Scatter Graphs
- Limitations of Scatter Plots
There is a very logical reason behind why data visualization is becoming so trendy.
As we enter the era of big data and the endless output and storing of exabytes (1 exabyte aka 1 quintillion bytes aka a whole, whole lot) of data, being able to make data easy to understand for others is a real talent.
Humans are visual creatures and thus, making data easy often means making data visual.
And so in this new series on data visualization, we’re focusing on one of the most common graphs that you can encounter: scatter plots.
In this post, we’ll take a deeper look into scatter plots, what they’re used for, what they can tell you, as well as some of their downfalls.
So, What are Scatter Plots?
Simply put, scatter plots are graphs where you plot each data point (consisting of a “y” value and an “x” value) individually.
The following plot shows a simple example of what this can look like:
You can see your data in its rawest format, which can allow you to pick out overarching patterns. They can be used for analyzing small as well as large data sets, which makes them a great go-to method for visual data analysis.
When to Use Scatter Plots
There are many different ways we can modify our scatter plots, but all of this still boils down to when we should use them in the first place.
Scatter plots are a great go-to plot when you want to compare different variables. All you need to do is pick two of your variables that you want to compare and off you go.
Scatter plots are great for comparisons between variables because they are a very easy way to spot potential trends and patterns in your data, such as clusters and correlations, which we’ll talk about in just a second.
However, if you’re more interested in understanding how one variable behaves, you’re better suited to go with plots like histograms, box plots, or pie, depending on what you want to see.
So, in a gist, scatter plots are best used for:
- Depicting the relationship between two numerical variables
- Allowing us to see the grand scheme aka “big picture” pattern of a specific set of data
How to Create Scatter Plots in Python
The easiest way to create a scatter plot in Python is to use Matplotlib, which is a programming library specifically designed for data visualization in Python.
If you’re not sure what programming libraries are or want to read more about the 15 best libraries to know for Data Science and Machine learning in Python, you can read all about them here.
But long story short: Matplotlib makes creating a scatter plot in Python very simple.
All you have to do is copy in the following Python code:
import matplotlib.pyplot as plt
In this code, your “xData” and “yData” are just a list of the x and y coordinates of your data points.
Tip: if you don’t have any data on hand that you want to plot, but still want to try this code out for fun, you can just generate some random data using numpy like this:
import numpy as np
xData = np.random.rand(500)
yData = np.random.rand(500)
In addition to being so easy to create graphs in, Matplotlib also allows for a ton of cool, fancy customizations.
Let’s say we want to compare two sets of data, and we want to have them be different symbols and colors to easily let us differentiate between them.
In Matplotlib, all you have to do to change the colors of your points is this:
import matplotlib.pyplot as plt
Or… let’s kick things up a notch.
How about creating something that looks like this fancy scatter plot where we scale the points based on how many values there are at that point, and changing the color based on the distance to the origin?
To do that, we’ll just quickly create some random data for this:
import numpy as np
xData = np.random.binomial(5, .5, 2000)
yData = np.random.binomial(5, .5, 2000)
Then we’ll create a new variable that contains the pair of x-y points, find the number of unique points we are going to plot and the number of times each of those points showed up in our data.
We then also calculate the distance from the origin for each pair of points to use for scaling the color.
uniquePoints, counts = np.unique(xyCoords, return_counts=True,axis=0)
dists = np.sqrt(np.power(uniquePoints[:,0],2)+np.power(uniquePoints[:,1],2))
Now that we have our data prepared, all we have to do is:
import matplotlib.pyplot as plt
plt.title(“Colored and sized scatter plot”,fontsize=20)
And ta-dah! We get this impressive lookin’ and fancy scatter plot.
Of course, plotting a random distribution of numbers is more for showing what can be done, rather than for being practical.
So let’s take a real look at how scatter plots can be used.
What to Use Scatter Plots For: 3 Applications of Scatter Plots
So now that we know what scatter plots are, when to use them and how to create them in Python, let’s take a look at some examples of what scatter plots can be used for.
1. Identifying Clusters in Scatter Plots
What are Clusters?
A cluster is a grouping of data within your dataset.
Clusters can take on many shapes and sizes, but an easy example of a cluster can be visualized like this.
Just kidding. Even though that’s a more fun way to think about clusters, this is what a cluster normally looks like in graph form rather than comic form:
This cluster is centered around 0 and stretches to about +/- 2 in every direction.
Clusters can be very important because they can point out possible groupings in your data. Not all clusters are just straight up blobs like we see above, clusters can come in all sorts of shapes and sizes, and it’s important to be able to recognize them since they can hold a lot of valuable information.
How to Identify Clusters
There are many approaches that you can take to identify clusters, but they can be simplified to be either:
- Using a visualization or
- Using an algorithm.
We won’t get into the algorithms here, but I’ll provide a simple overview. Clustering algorithms basically look for group-related or data points that are closer together, while separating different, or distant, data points.
These algorithms use a series of mathematical techniques to find general rules that can be used on any data set, and hence, become pretty intricate, which is why we won’t go into any more detail on them. There’s a whole field of unsupervised machine learning dedicated to this though, called clustering, if you’re interested.
With visualizations, this task falls onto you; so to better understand how to identify clusters using visualization, let’s take a look at this through an example that I made up using some random data that I generated.
Imagine you’re analyzing monthly spending habits from your close friend group (let’s pretend we have this many friends), and you have a hunch that monthly spending and monthly income are related, so you plot them on a graph together and get a little something that looks like this.
A bit of an unfortunate disclaimer in the efforts of being transparent, nothing is ever this obvious in real world data, because again, I’ve just made up this data.
But just for the sake of this example, let’s assume for now that this is what we see. You notice that your hunch is confirmed: monthly income and monthly spending are related, and in fact, they’re correlated (more to come on correlation later).
However, you also notice something else interesting: within this upward trend, there seem to be two groups.
Both groups look like they spend increasingly more based on the more they earn; however, in one group, this increases much faster and already starts off higher.
Some of them even spend more than they earn.
For clarity, you could probably draw a line between your data to separate the two clusters in your mind, and this line could look something like this.
What we see here is an example of two clusters, but these clusters are not simply circular like our example above, but rather, are more rectangle-shaped.
If we color coded the two different clusters, they would look like this.
Now after doing some investigation and by looking into the properties of the data points in each cluster, you notice that the property that best lets you split up these clusters is…
... whether or not the person owns a credit card.
So if we add a legend to our graphs, it would look like this.
What we got from here is a property that helps us separate our data into different groups, in this case, two groups, which provides valuable information about spending behavior.
Now you may be asking, “Okay, Max. But can’t I just split up the data by every single property available to me?”
You could, but a lot of them would not provide you with any valuable information.
For example, let’s say you try to split up the above graph into three groups, aged 18-29, 30-64, and 65+, and you visualized these three groups. Your plot could look like this.
This doesn’t provide you with any extra information.
Clustering isn’t just about separating everything out based on all the different properties you can think of.
In this case, owning or not owning a credit card helped us separate the groupings, but it also doesn’t have to be just one property.
You could also have groupings, or clusters, made out of multiple conditions like:
- Is the person aged between 20-30?
- Do they live near a Starbucks?
- Do they love coffee?
My spending habits would
probably definitely be positively correlated to these three factors.
Another important thing to add is that clusters don’t always have to be separated like what we saw just now. You could also have a cluster “hidden” (very mysterious) within your data that won’t become apparent until you visualize some of the properties.
For example, if we visualize the people that are working two jobs, we could see something like the following:
You’ll notice we have a separate grouping inside of our top cluster of people that own credit cards.
It seems like people with more than one job that have credit cards still spend less, probably because they’re so busy working the don’t have a lot of free time to go out shopping.
This is a smaller cluster within our larger cluster – a sub-cluster, if you will.
Although this cluster doesn’t have many data points and you could even make the argument of not calling it a cluster because it’s too sparse, it’s important to keep in mind that it’s definitely possible to find smaller clusters within a larger cluster.
It’s also important to keep in mind that when you’re visualizing data, you often have many different data sets that you can choose to plot and you often have more than 2 dimensions that you can plot, so you may see clusters along some regions and not along others.
For example, if we instead plotted monthly income versus the distance of your friend’s house from the ocean, we could’ve gotten a graph like this, which doesn’t provide a lot of value.
How to Use Clusters
Why is this important?
Well, let’s say you’re working for a coffee company and your job is to make sure your marketing campaign is seen by the people most likely to buy your product. With this information, you can now advise your team to target individuals who own a credit card and live close to a Starbucks, because they tend to spend more money.
Alternatively, if you are the founder of a personal finance app that helps individuals spend less money, you could advise your users to ditch their credit cards or stash them at the bottom of their closet, and that they should withdraw all the money they need for a month, so that they don’t go on needless shopping sprees and are more aware of the money they’re spending.
(And that maybe they shouldn’t drop by their local coffee shop so often.)
So, clustering is one way to draw meaningful conclusions out of your data.
They can have different properties; they could be thin and long, small and circular, or anything in-between.
You can even have clusters within clusters. If you think something could cause a grouping, trying color coding your data like we did above to see if the data points are closely grouped.
When looking for clusters, don’t be too quick to discard any patterns you see. Investigate them, and you could find something very useful hidden in your data.
2. Identifying Correlations in Scatter Plots
What are Correlations?
Correlations are revealed when one variable is related to the other in some form, and a change in one will affect the other.
Here’s an example of correlated data:
The above graph shows two curves, a yellow and a red. We can also see that when we move to the right in the x-axis-direction, that both curves correspondingly change in their y-value.
In fact, if we extended the graph to be a little bit larger, you would probably be able to guess what the curve would look like and what the “y” values would be just based on what you see here.
This is what you would expect from correlated data — that one value reacts in a predictable way if the other value changes.
Now in the above example, we see two forms of correlation; one is linear, which is the yellow line, and the other is quadratic, which is the red line.
Although a linear correlation is the easiest to test for, it’s very important to keep in mind that correlations can exist in many different ways, as you can see here:
In this graph, we can see a:
- Linear correlation
- Polynomial (quadratic, in this case) correlation
- Exponential correlation
- Logarithmic correlation
We can see that each of the lines have different relation between the two axes, but they’re still correlated to one another. When one changes, the other changes appropriately.
How to Identify Correlations
Just like with clusters, you can look for correlations using an algorithm, like calculating the correlation coefficient, as well as through visual analysis. It’s usually a good idea to do both.
Let’s understand what the correlation coefficient is first.
What is the Correlation Coefficient?
The correlation coefficient comes from statistics and is a value that measures the strength of a linear correlation. In other words, it is how reliably a change in one variable linearly affects the other variable.
The correlation strength is focused on assessing how much noise, or apparent randomness, there is between two variables.
When talking about a correlation coefficient, what’s usually meant is the Pearson correlation coefficient. Pearson’s correlation coefficient is shorthanded as “r”, and indicates the strength of the correlation.
This not not to be confused by the r2, or R2 value, which measures how much of the data’s variance is explained by the correlation. The “r” in here is the “r” from the Pearson’s correlation coefficient, so these two values are directly related.
The correlation coefficient, “r”, can be any value between -1 to 1, where -1 or 1 mean perfectly correlated, and 0 means no correlation.
The -1 just means that the correlation is that when one goes up, the other goes does, whereas the +1 means that when one goes up so does the other.
Take a look at these 4 graphs to see the correlations visually:
These graphs should give you a better understanding of what the different correlation values look like.
All of the above examples were for values between 0 – 1, but the values can also take on negative values, which just indicates a negative correlation (one goes up, the other down), that looks like this.
Or in comic form…
How to Visually Identify Correlations
Unfortunately, the correlation coefficient is only defined for linear correlations, but as we saw above, we can also have non-linear correlations.
A perfect quadratic correlation, for example, could have a correlation coefficient, “r”, of 0.
In this case, our data goes down before 0 and then symmetrically back up after. So it’s definitely not enough to just calculate a correlation coefficient for your variables and call it a day because you can only use the correlation coefficient to test for linear correlations.
It’s always a good idea to visualize parts of your data to see if you can spot other types of correlations that your linear tests may not find.
Here are some examples of how perfect, good, and poor versions of quadratic and exponential correlations look like.
When looking at correlations and thinking of correlation strengths, remember that correlation strength focuses on how close you come to a perfect correlation.
Don’t confuse a quadratic correlation as being better than a linear one, simply because it goes up faster.
A good correlation is one that looks very clean and the data points all lie very close to what you would imagine the perfect curve to look like. And as we’ve seen above, a curve can be a perfect quadratic correlation and a non-existed linear correlation, so don’t limit yourself to looking for only linear correlations when investigating your data.
What do correlations mean? How do you use/make use of correlations?
Even if you find a correlation between two variables, you should always be skeptical at first. It’s not uncommon for two variables to seem correlated based on how the data looks, yet end up not being related at all.
You can easily get results like this if you have 100 different variables, and you test how correlated each is to one another. This will give you almost 5,000 unique correlation values, and just out of pure randomness, you’ll probably find some correlation somewhere.
Sometimes, we also make mistakes when looking at data. Our brain is excellent at recognizing patterns, and sometimes, it sees things that aren’t actually there (like animal shapes in clouds), so it’s important to confirm what you think you’ve found.
Although there are many thorough tests that you can run to see how well the correlation you found holds up, like separating out part of your data for validating and another part for testing, or looking at how well this holds true for new data, the first approach you should always take is much simpler.
The first thing you should always ask yourself after you find a correlation is “Does this make sense”? This may seem obvious, but it’s something that’s very often forgotten.
Your data is not just a set of random numbers — there’s meaning attached to each variable that you have. So when you find a correlation between the amount of cloud cover and the amount of rainfall, ask yourself: does this make sense?
“The more rainfall there is, the more cloud cover is seen” makes sense, because you can’t have rain without clouds. Similarly, “the more cloud cover there is, the more rainfall there is” also makes sense.
This is called causation, and rainfall and cloud cover are causally related.
However, not everything is causally related, and just because you have a correlation does not mean they are causally related. You’ve probably heard this in short as correlation does not equal causation, the holy grail of data science.
Although we’ve just flipped our two variables around and the causation relation still makes sense, it’s common that a causal relationship does not hold both ways.
Using the cloud example above, if I told you that it rained a lot this week, you can also safely assume that there were a lot of clouds. Similarly, if I told you that there were a lot of clouds this week, you may assume that it probably rained at some point, but you would not be as confident about this.
However, if I told you that it didn’t rain this week, you probably couldn’t make a confident guess as to whether or not the weather was sunny, cloudy, or snowy. That’s because the causal relation does not hold up here.
So how do you know if the correlation you found is true or not? This can be a very hard task, but your best approach would be to first use your subject knowledge on whatever it is that you have data on.
If you don’t know much about the field you have data on, ask someone who does know. If you can’t find someone or they’re unsure, then it’s time to do some research by yourself to understand the field better.
Once you’ve confirmed from a subject matter perspective that the correlation could also be a causal relation, it’s usually a good idea to run some extra tests on either new data or data that you withheld during your analysis, and see if the correlation still holds true.
Make sure your data set is large enough that it’s unlikely that you found it by chance in both cases.
If the tests turn out well then you can be confident enough to say that there is a causal relationship between the two variables.
So what does this mean in practice? Well, let’s say you found a causal relationship between the number of newspapers you place an advertisement in and the number of orders you get.
If you’re preparing for a new campaign and you’re tight on budget, you can use this knowledge to balance the amount of your product that you’re stocking versus the amount that you’re spending on advertising.
Identifying the correlation between these two and applying it means you have enough merchandise in stock to meet demand after your advertisements go into the papers, without having too much stock left over.
There are many other ways that you can apply casual correlations; the result that you get from a correlation allows you to predict, with some confidence, the result of something that you plan to do.
3. Using Higher Dimensional Scatter Graphs
Sometimes, if you’re dealing with more variables, a two-variable scatter plot won’t provide you with the full picture. In this case, a 3-Dimensional scatter plot can help you out.
An example of this can be seen here:
The data that we see here is the same data that we saw above from a 2D point of view. Sometimes viewing things in 3D can make things even more clear than looking at them in 2D, because we can see more of a pattern.
For example, in the image above, not only does the red curve go up, but it also comes forward a little bit towards us. This is something that we would’ve missed when looking at just one 2D plot, and we would’ve had to create several different 2D plots and look at the data from different perspectives to be able to see this.
If you have a ton of data though, looking at 3D plots can become very messy, so you can keep them available as an option, but if things get too full or confusing, it’s perfectly fine to go back to our good ol’ 2D graphs.
Limitations of Scatter Plots
Alright. Now that we’ve talked about the incredible benefits of scatter plots and all that they can help us achieve and understand, let’s also be fair and talk about some of their limitations.
For one, scatter plots plot each data point at the exact position where they should be, so you have to take care of identifying data points that are stacked on top of each other. Otherwise, if we’re very zoomed out from the data or if we have identical data points, multiple data points could appear as just one.
This causes issues for both visual clustering as well as correlation identification.
- Visual clustering, because we wouldn’t identify distinct but very closely-packed data points as separate, and therefore may not see them as a very dense cluster.
- Correlation, because we may have a concentration of related data points within something that seems otherwise randomly distributed.
Here we can see what the blob of data we plotted above in the “What are clusters” section looks like zoomed out.
You’ll notice it’s extremely difficult to see that this is cluster. Now, of course, in this situation you can just zoom in and take a look. But what if I had more of these small clusters?
Take a look at this graph:
You may assume that there are about 100 individual data points here, when in actuality, they are about 100 different clusters! I just took the blob from above, copied it about 100 times, and moved it to random spots on our graph.
Although this example is a bit extreme, it’s important to be aware that these things could happen. Therefore, take note of the scale sizes in your data, and also think about how to visualize stacked data points (like we did in the “How to create scatter plots in Python” section).
For correlations, this inability to sometimes resolve different data points can really hurt us.
Take a look at this graph:
Thinking back to our correlation section, this looks like a pretty uncorrelated data distribution if you ever saw one.
Well, it could be that although on the surface, it may look like things are random, there are many more data points concentrated near a line that goes through the data, and a correlation test would tell you that there is a correlation between the data, even if you can’t visually see it.
Therefore, it’s important to remember that scatterplots have resolution issues. They do a great job of showing us how our data is distributed, but a poor job of showing us data repetition.
Now that you know what scatter plots are, how to create them in Python, how to use scatter plots in practice, as well as what limitations to be aware of, I hope you feel more confident about how to use them in your analysis!
Want more free help on getting started with data science?
If becoming a data scientist sounds like something you’d like to do, and you’d like to learn more about how you can get started, check out my free “How To Get Started As A Data Scientist” Workshop.
We go through everything we’ve covered in this blog post in more detail, dispel some common misconceptions, and give you a roadmap and checklist of what you need to do to get started to working as a Data Scientist.