Learn to scrape the web by creating your own quotes website 🤖 🤔
There's so much information on the web, but so few APIs 😕
Web Scraping lets you get content from any webpage, extracting information from HTML
selectors. This is a super simple guide to help you scrape the web with Node.js
, in less than 20 minutes 🕒
We'll learn to use developer tools to see HTML
selectors, extract the content in Node.js
with x-ray
- and use pug
to render quotes we get!
🛠️ Setup
Um, there's none really.
Just load up this repl - repl.it/@jajoosam/quote-scraper-starter ✨
We'll talk about all the tools and dependencies we use as we continue!
📊 Understanding Structure
Quotesondesign.com has some nice quotes - and is an easy introduction to scraping stuff on the web. Another great part - it loads up a random quote each time!
Open the website up, right click on the quote, and then hit Inspect
👇
You can now see a view of the entire document 📜 Go ahead and click all the small triangles to expand this view.
You'll be able to see that the quote itself is inside a paragraph, in a div
with id quote-content
, while the author's name has an id of quote-title
.
⬇️ Getting the quotes
To scrape information from a web page, you generally request its HTML
, and then extract the content you want with selctors, like id
s, classes
, and HTML tags
.
When we inspected Quotesondesign.com, we saw that the quote itself was in a div
, with id=quote-content
, nested inside a <p>
(Paragraph element) - while the author's name was inside an element with id=quote-title
.
Having this information makes scraping super easy 🙌 - we'll use a library called x-ray - which makes our job very straightforward! It's already installed in the repl you're using.
Try adding this code to your index.js
file 👇
x('https://quotesondesign.com',
{
quote: "#quote-content p",
author: "#quote-title"
}
)(function(err, result){
console.log(result)
});
Run your code, and the console on the bottom right will look something like this 👀
x-ray
gives you a nice json
object, which you can now render!
You've now successfully scraped the web 🎉
📜 Render those quotes
If you see the file tree in the sidebar, you'll see quotes.pug
- a template which can render quotes passed to it. We're using the pug templating engine to do this - which we've initialized on line 6
One thing to note is that pug
is whitespace sensitive: HTML
tags are nested inside each other with tabs ⌨️
All we have to do now is pass the quote we get from x-ray
to pug
! This is very easy to do on our express
server, just change your app.get
block to this 👨💻
app.get('/', (req, res) => {
x('https://quotesondesign.com',
{
quote: "#quote-content p",
author: "#quote-title"
}
)(function(err, result){
res.render('quote', result)
console.log(result)
});
});
Run your repl, and this is what you'll see 😮
Pretty neat, huh?
We've not written any css of our own, the page looks readable just because of the simple sakura.css library. Remove line 4
in quote.pug
and get ready for ugly 🤮
⚡ Putting your skill to use
There is a lot you can do with web scraping - and this guide has given you all the basic knowledge you need. I'm excited to see what you do with this 😄
Here are a few cool things to try 👇
- Scrape different data points - weather, latest news, bitcoin price 😛 - and make a dashboard for yourself
- Scrape IMDb to get a list of all movies currently in theatres 🎦
- Scrape Repl Talk and make an API for it 👨💻
Whatever you build, be sure to share it in the comments 💬
Here's what the final code looks like, feel free to refer to it 👇