In this blog post, let’s talk web-scraping: what it is, what can you scrape, and a step-by-step walkthrough of how to set up your own web scrape project!
What is web-scraping?
Web-scraping is the extraction of data from websites. (Vague, right? Don’t worry – we’ll get into details in a bit.)
The internet is your oyster when it comes to web scraping. Literally every website that you can find online is offering up its data to you to scrape.
What can you scrape?
IMDB movie rankings, search engine results for SEO, events, email addresses, social media trends, Netflix movie titles, government statistics, stock market data, job listings, dating profiles, apartment listings, Reddit posts, luxury vacation deals, etc.. The possibilities are endless.
I’m just here to teach you the simple steps – you’re the one that’s going to get crazy creative and gather all the data you can with it.
This is what I want you to look like after you learn web-scraping:
So let’s get you there with these simple steps:
How to Set Up a Web Scrape
Part 1: General Idea – What do you want to do?
Answer the following questions first:
- What do I want to know?
- What information can give me the answer?
It’s important to identify the clear problem that you are looking to solve through web-scraping as your first step.
Problem: I love tropical fruits. But they’re so crazy expensive.
Web-scraping solution: I keep track of the prices of mangoes, papaya and dragonfruit so I can find out when, and – if you look at multiple websites – where, it’s the cheapest.
If you personally don’t have a problem, think about a problem others might be having that you could solve (maybe your friend Jim is bad with the ladies and needs to filter through dating profiles faster).
Okay. Good. Now you have your problem and your solution – let’s build the path from that problem to the solution! Web Scraping Step-by-Step Checklist Name * Email Address *
Your Checklist will be on the way after you confirm your email!
Part 2: Identify the websites where your data is
Now that you roughly know what you want to solve and how you’re going to solve it, you’ll need to find the source of the data: aka the website where your data is.
>> Find a good source that has quite a lot of information, and specifically the key information you’re looking for. <<
My love of tropical fruits and my quest to find the cheapest options means I need a website that has the price and my favorite tropical fruits (mangoes and papayas, in case you were wondering.)
For your friend, Jim’s great quest for love, you would need a website that has a whole load of dating profiles and specific criteria he could filter for (hair color, interests, location, occupation).
Part 3: Get Your Data – Parsing the data from the HTML
Now we’re really getting into web-scraping. So the key to getting started once you have a problem and your website, is to go into the HTML of the website to find the data.
What is HTML? HTML is the language that defines the structure for the content of a website. Every website has all its content structure written in its HTML code.
You can find the HTML code of every website by (I’m so accommodating):
- Chrome: Customize → More Tools → Developer Tools
- Safari: Right Click + Show Page Source
- Firefox: Press Alt + Tools → Web Developer → Page Source
- Microsoft Edge: More Icon → Developer Tools
- Internet Explorer: Press Alt + View → Source
Go ahead and give that a try to find the HTML code of any website you fancy. Done? Awesome.
Now that we see all the HTML code, we’re probably a little, tiny bit
Because, seriously, there is just so much more stuff on a website than you’d expect. Here is what Netflix’s HTML page source looks like:
What is parsing? The point of parsing is essentially to pick out the really good stuff: useful, valuable data and leave all of the rest of the unnecessary, boring data.
For our love of tropical fruit, we would specifically need to parse out the prices of the fruit and the names of the fruit associated with those prices.
For the love life of Jim, we would need to parse out the name, contact information for girls that match his dream description (brunette, loves puzzles, pizza and cats).
Parsing data can consist of a lot of different methods depending on what the page source looks like and how it is set up.
>> The ultimate idea is that you identify specific patterns in the HTML in order to pull out the relevant data. But seeing as the pattern of code is different for all websites, it’s a very case-by-case technique. <<
I parse out the information by using Python, but you can use other languages as well (real talk though: Python is the best.)
Part 4: Set a Reference value (Optional)
Now that we have parsed out the data, we need to set up a system to be alerted every time we find something we’re looking for.
A reference value is an optional step, but I like to set it just to ensure that there is something to compare the values to.
There are two types of reference values:
- Static: average price of mangoes all year round – always the same value
- Dynamic: Changes throughout the year – store price average
Specifically, when I’m looking for tropical fruit, I want a reference value that is adjusted based on the season we’re in.
Obviously, tropical fruit is cheaper in the summer and more expensive in the winter; so if I want to have dragonfruit year-round, I need to be able to know when it’s the cheapest in the winter and when it’s the cheapest in the summer. Therefore, I need reference values that fluctuate based on the seasons.
Part 5: Set a Trigger Value
Now comes the important part: setting a trigger value. A trigger value is a value that you choose that gets ‘activated’ whenever something specific happens.
If you have a reference value, your trigger value is determined based on the reference value that you have set. It can be a percentage of the reference number (like 30% less) or an absolute number (like $3).
If my reference value in the summer is $3, my trigger value might be $2. That means every time dragonfruit dips below $2 in price in the summer, I get notified. In the winter, however, maybe my reference value is $4, so my trigger value is $3.
If there is a bad harvest one summer, the trigger value will automatically be adjusted because the reference value will account for the bad harvest by itself.
>> It might sound a little complicated, but it’s really not. It’s just a way for you to customize what you want out of the data. <<
Part 6: Notifications
Now that you have the data and the trigger values, you need a way to be notified. You will need to figure out a medium that will be able to send the information to you so you know that something has happened!
There’s a huge range of examples depending on what you prefer: phone call, email, text, or you can get real creative and have it post automatically on Twitter.
Part 7: Timer on Your Script
After you have most of everything set up, it’s time to let the program know how often you want it to run. There are also lots of options here depending on what exactly you’re looking for. If you’re looking for tropical fruit prices, it might not be very useful to run it 24/7, just because prices aren’t updated by the hour.
For tropical fruit prices, I would probably run the code once a day to make sure I am getting enough updates but not too many useless updates when no prices have changed.
Part 8: Set Up Script to Run Automatically
Finally, the last step would be to set up the code so it runs by itself. You don’t want to have to go on your computer everyday and click ‘Run’ on your program. That is a bit of a waste of time.
To set it up to run automatically, you can, for example, code that the program runs ‘every time the computer boots up’ or you can code that it runs ‘always at 11 AM.’
>> That’s basically it! You did it! <<
Those are the step-by-step instructions of how I set up a web-scraping program. It is definitely simpler than it sounds – I always feel like web-scraping sounds super serious and technical, but it’s actually not when it’s broken down step-by-step.
TL;DR of this blog post, basically: how to get notified when Mangoes are on sale and how to find an awesome girlfriend for Jim