Data science is becoming a hotter and hotter topic by the minute, and data scientists are becoming more and more demanded by all sorts of companies. I personally like to think of data scientists as the watermelon of the fruit aisle in the summer. Everyone wants one – but there is a limit to how many there are. (I love watermelon and struggle every summer to find good watermelon at the store.)
The term “Data Scientists” is very vague though – what exactly does it mean, how can you do it, and what can you do with it?
What does a Data Scientist do?
Well, the answer to that is you can pretty much do whatever you like, with whatever you want, on anything you that tickles your fancy.
The only real thing you need is to data.
Be it finance, healthcare, sports, government, industry, research, entertainment, or software engineering – everyone and their mother is producing more and more data, because they’re being told it’s valuable – which is 100% true.
Data is so valuable – it tells you exactly who likes your product, when they’re buying it, how they’re buying it, how often they’re buying it. It also tells you what people want, what they love, what they need – what they desire at 1am on a Friday night (ice cream mostly, for me.)
Information is power, but data does not directly equal information, and that’s where a data scientist comes in.
A data scientist’s job is to turn simple, raw, and unprocessed data into an information gold mine.
Essentially, they take an ugly big pile of messy data and turn it into a shiny, polished conclusion that everyone can understand. Then they give recommended actions to take based on their conclusions, and this is where the real treasure lies.
So now that we’ve brushed upon what data scientists do and why they’re so cool (not biased opinion, at all) – let’s find out how to become one!
Here is my step-by-step guide on how to get started with data science and become the sought-after watermelon in the fruit aisle.
STEP 1: Learn a Programming Language
The first step to becoming a data scientist is to learn a programming language.
There are many programming languages and it’s hard to know which one to choose, but let me give you a rundown of what you want in your language:
- SUPPORTIVE: You want to make sure there’s a big community for it, which you can turn to for advice, like on stack exchange.
- POPULAR: You want lots of pre-written code (libraries) that you can integrate into your own code, like on github. This way, for example, you don’t have to understand how to create a graph from scratch, you can just select the graph you want and feed in your data.
- EASY: You also want a language that’s easy to write in, so you don’t make little mistakes that then result in bugs you may spend hours trying to find. This means it’s very easy for you and others to review what you’ve done.
- FAST: You want to be able to write programs fast. You want to spend your time analyzing the data not writing code. The faster the programming language lets you create prototypes the better.
- POWERFUL: You want to have the option to do long and complex tasks that still run fast and that can be easily integrated into other platforms.
Considering these qualities, the most common programming languages used by data scientists are Python and R. Some other viable ones are javascript, C++, Matlab, and SAS.
Personally, I use Python. It has a huge community, and its readability and user-friendliness makes it very easy and fast to code in. However, it is still a very powerful language, it lets you do anything from data gathering, like through web-scraping, to a full blown analysis and automatic reporting.
If you want to see why I love Python so much, you can read that article here. The whole package of, ease, speed, power, and community is what makes Python so unique, and this is also why I recommend it over other languages, like R. Oh, and did I mention that it’s completely free?
Of course, if you feel more comfortable using another language by all means, do so, at the end of the day whatever you’re fastest in and most comfortable with is what’s best for you!
So now that you have the skills to make your computer crunch data, visualization will set you up to do great analysis and creating comprehensive reports.
Step 2: Make Graphics Your New Best Friends
Visualizations serve two purposes for a data scientist:
- They let you analyze data more easily
- They make it much easier to communicate what you’ve done with others
Visualizations play a very important role during your analysis because they let you literally see how your data behaves.
Humans have this amazing ability to detect patterns and see complex behaviour. Sometimes this lets you see things that are actually just random, such as seeing animals in the clouds, but they can also help breakdown complex topics.
You can use the patterns you see to direct your investigation. This means that sometimes you may follow a false lead, like the animals in the clouds, but often this takes you in the right direction.
The more you do this, the better you’ll get at being able to differentiate true information (signal) against ones that are just produced through chance (noise).
The nice thing about all of this is that others can also see these patterns on visualizations, especially if you point them out to them.
So, as a data scientist, you’ll be creating the visualizations both to help guide your analysis as well as to visualize results. Once you’ve completed your analysis, if you have to create a report or presentation, you can then pick out the ones that actually say something valuable.
That way, when you’re in a board or investor meeting, you can pull up some pretty graphs, point to them, and say the graph nicely shows that people like eating watermelons. Then everyone listening to the presentation will nod their heads at each other and say “watermelons, who would’ve thought?”. And then everyone will stock more watermelons and we all (or I) win.
This example is, of course, overly simplified, but I’m sure you get the idea of how valuable visualizations are.
So, what are some cool graphs that you should know of? Being able to read and create the following graphs will cover you in almost all situations:
- Line graphs
- Scatter plots
- Bar Graphs
- Histograms
- Pie Charts
- Box and Whiskers plot
So how can you practice creating and reading these types of graphs? Matplotlib as an amazing library for Python, so I’d highly recommend learning to use that to start making visualizations.
A very good idea is to go out and read articles that have graphs in them. Read whatever you are interested in, and see what the authors have to say about the graphs, and how they use the graphs to support their points.
Sure, sometimes the authors may get it all wrong, but most of the time, just reading interpretations can help you understand their train of thought, and allow you to get a better insight into how graphs can be used to show info.
Step 3: Learn How to Analyze Data
A good thing to learn alongside of creating and reading the above types of graphs is how to analyze data.
The only way to properly analyze data is to be able to filter, group, drop, aggregate, or manipulate it in other ways. Otherwise you won’t be able to correctly control and contextualize your analysis, or have the ability to zoom in when answering very specific questions.
This part of the analysis is done with the help of a program, and I’m not talking about some spreadsheet software here because, let’s face it, they get cranky after about 1000 rows and a couple columns of data. To do this properly you need to do your analysis using code.
Fortunately, Python also has an amazing library for data analysis, called Pandas, that you can just freely download and then use in Python. No biggie. You can probably start to see why I like Python so much.
There’s also an important second component to data analysis which is, and you may have guessed this already, the data scientist.
Just like when visualizing data, your job when analyzing data is to steer the analysis. You know what questions you want answers to and you know what tests you want to do. Your job is to learn to correctly contextualize and interpret data, as well as to recommend actions based on your findings.
Statistics can be tricky sometimes, and it will try to throw you off by sometimes having results that may look good, but are actually just there by chance. Therefore it’s also important to be able to differentiate good results (signal) from results that come from “noise” aka where it may not be good to jump to direct conclusions.
Having just mentioned the word “noise” already, you should be aware of something. I hate to break it to you, but not all data is clean. Shocker, I know.
Most of the time, data is pretty noisy.
No… Not like that.
What is Noisy Data? Noisy just means that there’s randomness going on in the data. You’ll probably never get a perfect straight line if the data was measured [if you see perfect lines, there’s a high chance that someone has forged some data], but rather, you get some “pretty good” curves with your points lying above and below it. That’s normal, and this has to do with the accuracy with which we can measure things.
You should be wary though that if your data is too noisy, it will be hard to draw meaningful conclusions. If your sample size is too small and/or the accuracy is very low, it’ll be very hard to be able to deduce anything significant from it. You may end up with a false positive – seeing a pattern where there actually isn’t one.
Step 4: Learn How To Interact With A Database
As a data scientist you’re going to need data. That data is usually stored in a database. Therefore, you’ll need to learn how to interact with a database.
The most common database type you’ll encounter is an SQL-based database. There are very many different databases that are based on SQL, such as PostgreSQL, BigQuery, or MySQL.
These are all very popular databases and you’ll probably encounter one (or several) of them pretty early on in your data science career.
Not only will you be getting your data from here though, but another perk of being able to interact with databases is that you can run part (or all) of your analysis when querying through your database.
SQL databases are also very nice to use because it will speed things up a lot when you start processing, formatting, or even doing part of your analysis in your query, rather than taking out the data and doing all of it afterwards.
Step 5: Learn How To Gather Your Own Data
Regardless of if you want to start your own project, or you’re working with a company that has a huge database, odds are you’ll probably still need some extra data. This is why it’s so important to be able to gather your own data from anywhere online.
You can gather this data either through using APIs or through web-scraping. To decide on which to use, the first thing you’ll want to check is ‘Where are the resources/websites that I can get this data from’?
Then you want to check if any of those have an API. If they do, I’d almost always recommend to go with the API over a web-scrape.
Not sure what an API is? No worries. Check out our beginner’s guide to APIs here!
APIs are a lot easier to deal with because they’re providing all the services for you. Whoever set up the API works on maintaining the data, the also do a lot of pre-processing and cleaning for you, and when you get data over an API, it’s already much more formatted. This means that there’s a lot less work for you, because someone else has already done it, which is always awesome.
Although there are extremely many APIs out there (some free, some not), there’s still a bunch of information that’s available on a website but not over an API. In these cases your only option is to scrap the website.
You’ll encounter two types of ways that data is presented through a website:
- Static
- Dynamic
Static data is when the data is directly written into the page that you’re seeing, as is the case for the text written in this blog post.
Dynamic data is when the data is visible as the result of some script that’s running. This is often the case when you see live tickers or other values that are changing or updating.
And drum roll…
Finally, the last step and the ultimate secret sauce to getting started with data science is confidence. You need to build confidence in your work and abilities. You need to be so assured of your work that you would scream it off the roof tops at random strangers.
You’ve gone out and done the work and it’s time to reap the rewards. Own it! Feel good and confident about your skills as a data scientist and always look to improve.
There are many other Data Science techniques, methods and intricacies out there for you to learn about, but at this point, you already know so much more than most others trying to get into this field.
** Tip**
A great website that you can use to learn about data science is Kaggle. It’s a website that has a bunch of data sets that you, and other data scientists, can try analyzing. You can also share your results as well as read what other people have done, that way you can get feedback on your own tests and also learn how others approached a similar problem.
I highly recommend Kaggle as a resource for all data scientists, whether beginner or advanced.
After some practice, it will be time to start applying.
Pick out the jobs the positions that are most interesting to you, and take all the interviews you can get. Every interview will get you practice and show you what things you feel comfortable talking about, and what things you should do a little more research on.
After that, it’s all about learning on the job. Every company and every position is going to expect different things from you and rather than trying to learn everything beforehand, this is often stuff you pick up on the job.