Information is abundant and readily available on the internet. However, the sheer amount of data can be overwhelming and time-consuming to navigate through. That's where web scraping comes in - a powerful tool used to extract data from websites and turn it into a usable format.
In this tutorial, we will explore the basics of web scraping and how to implement it using Scrapy (a Python framework). Whether you are a data analyst, programmer, or researcher, this tutorial will equip you with the fundamental skills needed to create your own web scraper and extract valuable information from websites.
During the tutorial, we will learn concepts of web scraping doing some exercises against a fake website with challenges that are very similar to the ones we will find in real websites using Scrapy, a Python framework designed for web scraping tasks.
To better follow the tutorial, it would be good to install Scrapy in your machine. More information at https://github.com/rennerocha/europython-2023-gathering-data-tutorial#before-the-tutorial
(presentation) Web scraping fundamentals
- What is web scraping, why we want to do web scraping and real use cases for web scraping
(presentation) Scrapy basic concepts
- What is Scrapy, what its advantages over other tools, its basic classes (Spider, Request, Response, parses) and how to create a very simple web crawler
(presentation + exercise) Scraping a basic HTML page
- Gathering data from http://quotes.toscrape.com/, a plain well-structure HTML page
- Gathering data from http://quotes.toscrape.com/scroll, where the data is gathered from an API call
(presentation + exercise) Scraping page with forms and ViewState
- Gathering data from http://quotes.toscrape.com/search.aspx, where we need to handle forms submission and ViewStates
(presentation) Proxies and headless browsers
- When simple requests are not enough, what other tools we have to proceed
(presentation) Being polite and not gathering data you shouldn't gather
- How to avoid interfering in your target website and some restrictions about what data you can and you can't gather