Share your repls and programming experiences

← Back to all posts
A Webscraper Written in Python
a5rocks (506)

This webscraper makes use of queue objects, and requests-html. Let me provide an explanation of both, and why I am using them.

queue: Queues are a data type that allow for what is basically a to-do list. You add to it using Queue.put(). For example, Queue.put('https://rickroll.here') adds 'https://rickroll.here' to the queue. To get something from it, you just use Queue.get(), which gets your item, then removes it from the queue. This way, you don't need to worry about removing links you already scraped.

requests-html: requests-html is basically a wrapper around requests and beautifulsoup4. It allows a lot of functionality for getting information from the Internet. For example, here is how you get a webpage's links (assuming session is requests_html.HTMLSession()) :

r = session.get('https://crouton.net')
links = r.html.links

All in all, this project was just me checking out requests-html and queues in a project I actually finish.

Commentshotnewtop
a5rocks (506)

Please note that I complicated the project somewhat by making it not fail the second something raised an error.

Here is the project, in a nutshell (the differences compared to the version above are error reporting and a tiny bit of optimization) :

from requests_html import HTMLSession
from queue import Queue

start_page = 'https://html.python-requests.org/'

links_queue = Queue()
session = HTMLSession()

links_queue.put(start_page)

while True:
  to_get = links_queue.get(block=False)
  r = session.get(to_get)

  for link in r.html.absolute_links:
    links_queue.put(link)

  print(to_get)