Web Scraping in Python: Beautiful Soup, Selenium, or Scrapy?
Find the best web scraping tool for your project.
Hi!
In the previous post we’ve seen the easiest way to web scrape with Python and now it’s time to see traditional and more robust tools. There are different tools and, in this article, we’ll see what’s the most convenient for your project.
I’m preparing some tutorials with these web scraping libraries. These tutorials aren’t included in my course, so stay tuned!
The internet is full of data available for you to start a project. Obtaining that data could be as simple as copying and pasting it, but when it comes to large data, web scraping is the best solution.
If you google “how to web scrape with Python,” you’ll get many tutorials using different Python libraries and frameworks. In this guide, we’ll analyze the 3 most popular web scraping tools in Python, so you can choose the one that suits best your project.
Beautiful Soup
Beautiful Soup can pull data out of HTML and XML files. On top of that, it’s the easiest to learn among the 3 options.
However, Beautiful Soup has some dependencies, such as the need for the request
library to make requests to the website and the use of external parsers to extract data (XML, HTML, etc). These dependencies make it complicated to transfer code between projects.
Let’s see how scraping sites with Beautiful Soup looks.
import requests
from bs4 import BeautifulSoup
# sending request and parsing
website = requests.get('https://example.com').text
soup = BeautifulSoup(website, 'html.parser')
# extracting data
headlines = soup.find_all('span', class_='class-example')
data = [headline.text for headline in headlines]
As we can see, only a few lines of code are needed to extract data with BeautifulSoup, but we need to import requests
to access the website and html.parser
to parse the content.
Selenium
Selenium wasn’t originally designed for web scraping. In fact, Selenium is a web driver designed to render web pages for test automation of web applications.
This makes Selenium good for web scraping because many websites rely on JavaScript to create dynamic content on the page. Other web scraping tools like Beautiful Soup don’t have this functionality, limiting the extraction of data available on most websites.
Selenium is not as easy to learn as Beautiful Soup; however, it’s still a friendly tool since it allows code to mimic human behavior such as clicking on a button, selecting dropdown menus, maximizing windows, etc.
Let’s see an example of web scraping with Selenium.
from selenium import webdriver
path = '/Users/.../chromedriver' #path of your driver file
driver = webdriver.Chrome(path)
driver.get("http://www.example.com")
driver.find_element_by_xpath('//*[@id="accept-button"]').click()
elements = driver.find_elements_by_class_name('class-example')
data = [element.text for element in elements]
driver.quit()
One of the disadvantages of Selenium is speed. Web scraping with Selenium is slower than HTTP requests to the web browser because it executes all scripts present on the web page. However, if speed isn’t a top priority, Selenium is a good option.
Scrapy
Scrapy is a web scraping framework built especially for web scraping and written entirely in Python. It’s built on top of Twisted, an asynchronous network framework, which allows applications to respond to different network connections without using traditional threading models.
One of the biggest advantages of Scrapy is speed. Since it’s asynchronous, Scrapy spiders don’t have to wait to make requests one at a time, but they can make requests in parallel. This increases efficiency, which makes Scrapy memory and CPU efficient compared to the previous web scraping tools analyzed.
One drawback of Scrapy is that it doesn’t handle JavaScript by default, but it relies on Splash to do the job. Also, the learning curve to learn Scrapy is steeper than tools like Beautiful Soup and the installation process and setup can be a bit complicated.
Which is the best web scraping tool?
After analyzing each scraping tool, let’s see which one excels in different scenarios.
Size of your project
When it comes to large-scale projects, Scrapy is the best option because of its architecture and functionalities. It also facilitates project migration, which benefits large projects.
Beautiful Soup would fit better for small and simple projects, while Selenium would be in the middle between these two options since it can extract data from websites that use JavaScript, but the approach it uses for web scraping isn’t efficient.
Performance
Scrapy is the one with the best speed since it’s asynchronous, built especially for web scraping, and written in Python. On the other hand, Beautiful Soup and Selenium are inefficient when scraping large amounts of data.
Ease of use
Beautiful Soup is the easiest option for web scraping. Its simplicity and straightforward approach help beginners learn web scraping fast. Selenium is easy to learn too, while Scrapy tends to be more complex.
Summary
Below you’ll find a table with a summary of each web scraping tool.
In a nutshell, Scrapy is best when dealing with large projects where efficiency and speed are top priorities. Selenium excels in dealing with core JavaScript-based web applications and it’s good for projects where speed isn’t relevant. Finally, Beautiful Soup is better for beginners who want to start simple web scraping projects.
Seleniun learning curve is not that easy, take some time for beginner to learn the API.