Web-Scraping Using Python-Based Scraping Tools

Ask a data engineer and likely they will confirm that they have found themselves praying that a given website has a built-in API for their regular web-scraping requirements. When it doesn’t – and, often sites don’t – they must rely on web scraping tools.

Using Python-based web-scraping tools has a good number of benefits. Each tool has their own nuances respectively. Below is a quick preview into a few I like to use and what they might be able to help you with when it’s time to site-scrape!

Scrapy

Scrapy is an open source and collaborative web crawling framework, written entirely in Python. According to Scrapy’s official documentation:“It can extract structured data which can be used for a wide range of useful applications, like data mining, information processing, or historical archiving.”  One of the main advantages of Scrapy compared to other web-scraping tools is that it’s built on top of Twisted, an asynchronous networking framework, which in other words means increased efficiency when requesting one-by-one data pulls without having to wait for previous requests to finish. Here’s an example of asynchronous requesting:

Let’s say you need to call 100 different people’s phone numbers. Normally, you would dial the first number on the list and wait for a response on the other end, effectively wasting time with the waiting process. However, using an asynchronous process to improve efficiency, we can dial the first 20-30 phone numbers and only process the dials that result in the person on the other end physically answering the call.

urllib2 and Requests

The urllib2 is a Python 2 module (called urllib.request and urllib.error in Python 3) that defines functions and classes that help in opening URLs in a complex world (i.e. basic and digest authentication, redirections, cookies etc). urllib2’s biggest advantage is that it’s considered the Python standard library module, which means as long as you have Python installed, you are good to go. This module was popular until the launch of a similar tool called Requests. Compared with urllib2, Requests has a more comprehensive official documentation library and also allows the user to send organic, grass-fed HTTP/1.1 requests, without the need for manual input. Nowadays, Requests is arguably the most popular module for Python, period. Unlike urllib2, Requests doesn’t come pre-installed with Python, meaning the user will have to install it before using.

Beautiful Soup and lxml

Beautiful Soup is designed for quick turnaround projects like screen-scraping and extracting data points from loaded pages. It provides three main features:

  1. A few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: essentially a toolkit for dissecting a document and extracting what you need
  2. Automatically converts incoming documents to Unicode and outgoing documents to UTF-8
  3. Sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or to trade speed for flexibility if you need it

lxml is similar to Beautiful Soup in that it handles scraping data from XML and HTML. According to lxml official documentationlxml is the most feature-rich and easy-to-use library for processing XML and HTML  in the Python language.” However, unlike Beautiful Soup, lxml’s documentation is not as beginner-friendly.

Selenium

Selenium Python bindings is a writing tool that allows testing of browser-based web apps using the Selenium WebDriver. It can easily handle Javascript-heavy sites, which are becoming increasingly popular also provides an easy data extraction method from javascript based sites. During the scarping process, we frequently use Scrapy selectors to grab the HTML that Selenium produces. Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome, Remote, and more. Currently supported Python versions are 2.7, 3.5 and above.

What are some other scraping tools (Python-based or not) you might have used successfully? 

GCP as a CDP

Create your own Customer Data Platform leveraging the power of Google Cloud A Customer Data Platform (CDP) is an important utility for analysts and especially

Read More...