Behind AI #5: Web Scraping in Python with Beautiful Soup
Getting started with web scraping in Python (Part 1).
Hi!
I prepared two tutorials to help you get started with Beautiful Soup and Selenium. In these tutorials, we’ll scrape a simple website from scratch so that you see with your own eyes what are the differences between these two libraries.
In this article, we’ll extract football data from all the FIFA World Cups played between 1930 to 2022. That’s around one thousand games.
Here are six of these games (we’ll extract some of this data).
To extract this data, we’ll scrape Wikipedia using Python and Beautiful Soup. The data we want to extract is split into multiple Wikipedia pages, so we’ll start by extracting data from one page and then we’ll create a for loop to extract data from all the pages.
Let’s install the libraries.
Installing the libraries
In this tutorial, we’ll use bs4
to scrape websites, lxml
to parse HTML documents, and requests
to send requests to the target website.
Here’s the command you need to run in the terminal to install these libraries.
pip install bs4
pip install lxml
pip install requests
In addition to the previous libraries, we’ll install pandas to better manage the data we’re going to extract.
pip install pandas
Now let’s start coding!