Learn to Code via Tutorials on Repl.it!

← Back to all posts
How to scrape HTML tags from a webpage (synchronously)
minermaniac447 (248)

How to scrape HTML tags from a webpage (synchronously)

This is a tutorial about how to scrape html for data from certain tags synchronously. This is very simple to edit. For this example, I'm going to be finding the results for a YouTube search.
Now, this is a bit simpler than the async version, but you wouldn't want to use this in something like a Discord bot (my original use case). First we need to import our essential packages:

import requests
import re

Requests will allow us to access webpages to get the tags, and re will allow us to search for the tags on the decompiled HTML. Now, we need to create a function, let's call it ytsearch. It will need 2 parameters - num and query.

def ytsearch(num, query):

Inside this function, the first thing we should do is to use requests.get on the url. To allow us to search for anything, we're going to format the url with the query. So, this is what that line should look like:

page = requests.get("https://youtube.com/results?search_query={}".format(query))

This will give us a Response object, which we can get the html from using the following line:

html = str(page.content)

Next, we need to find all occurrences of the tag and scrape the video id from each. I've found that this occasionally gives duplicates of the same ID, so I've had to create 2 lists to remove duplicates. However, you probably won't need to do this for every use case.

dresults = (re.findall('href=\"/watch\?v=(.{11})', html))
results = []
for result in dresults:
if result in results:

Let's walk through what this does. The first line uses regular expressions (through the re module) to find all occurrences of href="/watch?v=, and gets the next 11 characters after that. It saves all the IDs on the page to the list dresults (for results with duplicates). Then we have our final results list, which starts empty. The rest of the code copies each item from dresults to results if the value isn't already in there (removing duplicates).
Now, you may remember we got num as an argument for our function. The problem with a user-specified number of items to retrieve is that if they enter a number that's too high, it will throw an error. Therefore, we need to include code to limit num to the length of results:

if num > len(results):
	num = len(results)

Finally, we just need to print out each full YouTube link!

for x in range(num):
	link = "https://youtu.be/" + results[x]

This will iterate through the results list num times, printing a youtu.be link each time.

How to edit

To edit this code, all you need to do is change the search parameters in re.findall(). You can change the number of characters you get by changing the number inside the brackets from 11 to something else.

And that's it for the synchronous version! Stay tuned for the async version, coming soon!


a5rocks (793)

Would requests-html work? It seems to have a good parser.