Business vector created by Freepik

What is Web Scraping?

One of the more popular uses of Python, web scraping is a powerful tool that you can use to play with data found on the Internet. Also known as web harvesting, programs make use of web scraping to read through HTML websites to retrieve useful information for data processing purposes or simply for information sharing.

Before You Learn to Web Scrape..

In order to understand how web scraping is done, one must have basic understanding of HTML fundamentals and syntax. Being able to read and understand the format of which HTML web pages are presented is good enough. Check out this resource if front-end language seems foreign to you, or if you just need a bit of a refresher.

Modules Required

Web scraping revolves around breaking down the HTML content of web pages and extracting what you want. Python offers the BeautifulSoup module which allows you to parse HTML into a format that you can work with. You can also make use of urllib.request to access webpages.

basic html syntax

Approach

Web scraping can be done in many different ways, but the main approach is as follows:

  1. Use requests library to pull data from the webpage
  2. Use BeautifulSoup library to traverse and select relevant portions
  3. Input into the main program/file

STEP 1: REQUESTS LIBRARY

 

The request library allows you to retrieve HTML source codes from websites.

Requests[200] indicates a successful web request

Here, myrequest is a Response object that you can interact with using its API library. For example, myrequest.text will print the HTML code retrieved while myrequest.encoding will print the related HTML encoding involved. This is a very brief description of the request library, so make sure you take a closer look at the module’s quick-start docs!

STEP 2: BEAUTIFUL SOUP LIBRARY

 

Next, the BeautifulSoup library allows you to parse the messy HTML code into a BeautifulSoup object. What’s so special about the BeautifulSoup object? Using its API library, we are able to use its methods to access very specific portions of the HTML code to read and use. To understand what you can really do with BeautifulSoup, take a look at its quick-start docs!

BeautifulSoup objects can be accessed based on HTML tags

This is where basic knowledge of HTML comes in handy. BeautifulSoup objects allow for easy access to specific information based on HTML tags. As such, API methods such as soup.find_all() can be used to select what is needed and ignore the rest.

STEP 3: PUTTING IT INTO THE MAIN PROGRAM

 

Now that you can access the specific data that we need, you have the flexibility to work with the data on hand! You can do many things – from pulling up a list of items on sale on an e-Commerce website to comparing timetables between multiple students! This method can easily be added to bigger programs or scripts. Pro tip!: BeautifulSoup can also pull images from websites!

One thing to take note: Not all websites can be scraped. Some websites are protected from web scraping for legal issues, while others simply have complex HTML formatting that requires a more complex scraping code. For practice, you can use toscrape.com, which was what I used for demonstrative purposes.  

Recommended Resources

There are many resources available to learn web scraping, depending on the type of learning style you prefer. Here are some resources we found to be most reliable and effective in learning the basics.


Python for Data Science Essential Training – Web Scrape in Practice

Lynda.com is freely available to the NTU Community via NTULearn.

To access LyndaCampus, log in to NTULearn, go to “Self-paced Learning” and click on the LyndaCampus link provided, or simply log in at this link. Then, search for the course title given, or click on the image below.

 

Here’s an example of web scraping being used to extract random quotes from the TV series “How I Met Your Mother”.

Try making use of web scraping with the next application that you develop with Python, and share with us in the comments below. 

Have fun scraping!

For more Python programming resources, check these other posts out.
Be sure to follow us on Twitter @NTUsgLibrary, and our hashtag #NTUsgLibraryDS