This webscraper makes use of
queue objects, and
requests-html. Let me provide an explanation of both, and why I am using them.
queue: Queues are a data type that allow for what is basically a to-do list. You add to it using
Queue.put(). For example,
'https://rickroll.here' to the queue. To get something from it, you just use
Queue.get(), which gets your item, then removes it from the queue. This way, you don't need to worry about removing links you already scraped.
requests-html is basically a wrapper around
beautifulsoup4. It allows a lot of functionality for getting information from the Internet. For example, here is how you get a webpage's links (assuming
r = session.get('https://crouton.net') links = r.html.links
All in all, this project was just me checking out
queues in a project I actually finish.
Please note that I complicated the project somewhat by making it not fail the second something raised an error.
Here is the project, in a nutshell (the differences compared to the version above are error reporting and a tiny bit of optimization) :
from requests_html import HTMLSession from queue import Queue start_page = 'https://html.python-requests.org/' links_queue = Queue() session = HTMLSession() links_queue.put(start_page) while True: to_get = links_queue.get(block=False) r = session.get(to_get) for link in r.html.absolute_links: links_queue.put(link) print(to_get)