Webscraping Course



Contents

  • What is Beautiful Soup?
  • Application: Extracting names and URLs from an HTML page
  • But wait! What if I want ALL of the data?

Version: Python 3.6 and BeautifulSoup 4.

This file will contain the Python script that we will be developing over the course of the tutorial. To begin, import the Beautiful Soup library, open the HTML file and pass it to Beautiful Soup, and then print the “pretty” version in the terminal. Web Scraping Course In this course, you will learn the most important tools of Web scraping in Python, and when to use each one. If you ever thought about scraping a website, but gotten confused due to all the options, or didn’t even know where to start, then this course is for you. Python Project for Data Science. IBM 4.2 (426 ratings). This mini-course is intended to for you to demonstrate foundational Python skills for working with data. The completion of this course involves working on a hands-on project where you will develop a simple dashboard using Python. This course is part of the IBM. Plotting the approach. If the ROC curve were a perfect step function, we could find the area under it by adding a set of vertical bars with widths equal to the spaces between points on the FPR axis, and heights equal to the step height on the TPR axis.

This tutorial assumes basic knowledge of HTML, CSS, and the DocumentObject Model. It also assumes some knowledge of Python. For a more basicintroduction to Python, see Working with Text Files.

Most of the work is done in the terminal. For an introduction to usingthe terminal, see the Scholar’s Lab Command Line Bootcamp tutorial.

Web scraping coursera

What is Beautiful Soup?

Webscraping Course

Overview

“You didn’t write that awful page. You’re just trying to get some dataout of it. Beautiful Soup is here to help.” (Opening lines of BeautifulSoup)

Web scraping course outline

Beautiful Soup is a Python library for getting data out of HTML, XML,and other markup languages. Say you’ve found some webpages that displaydata relevant to your research, such as date or address information, butthat do not provide any way of downloading the data directly. BeautifulSoup helps you pull particular content from a webpage, remove the HTMLmarkup, and save the information. It is a tool for web scraping thathelps you clean up and parse the documents you have pulled down from theweb.

The Beautiful Soup documentation willgive you a sense of variety of things that the Beautiful Soup librarywill help with, from isolating titles and links, to extracting all ofthe text from the html tags, to altering the HTML within the documentyou’re working with.

Installing Beautiful Soup

Installing Beautiful Soup is easiest if you have pip or another Pythoninstaller already in place. If you don’t have pip, run through a quicktutorial on installing python modules to get it running. Once youhave pip installed, run the following command in the terminal to installBeautiful Soup:

You may need to preface this line with “sudo”, which gives your computerpermission to write to your root directories and requires you tore-enter your password. This is the same logic behind you being promptedto enter your password when you install a new program.

With sudo, the command is:

Additionally, you will need to install a “parser” for interpreting the HTML. To do so, run in the terminal:

or

Finally, so that this code works with either Python2 or Python3, you will need one helper library. Run in the terminal:

or

Application: Extracting names and URLs from an HTML page

Preview: Where we are going

Because I like to see where the finish line is before starting, I willbegin with a view of what we are trying to create. We are attempting togo from a search results page where the html page looks like this:

to a CSV file with names and urls that looks like this:

using a Python script like this:

This tutorial explains to how to assemble the final code.

Get a webpage to scrape

The first step is getting a copy of the HTML page(s) want to scrape. Youcan combine BeautifulSoup with urllib3 to work directly with pageson the web. This tutorial, however, focuses on using BeautifulSoup withlocal (downloaded) copies of html files.

The Congressional database that we’re using is not an easy one to scrapebecause the URL for the search results remains the same regardless ofwhat you’re searching for. While this can be bypassed programmatically,it is easier for our purposes to goto http://bioguide.congress.gov/biosearch/biosearch.asp, search forCongress number 43, and to save a copy of the results page.

Selecting “File” and “Save Page As …” from your browser window willaccomplish this (life will be easier if you avoid using spaces in yourfilename). I have used “43rd-congress.html”. Move the file into thefolder you want to work in.

(To learn how to automate the downloading of HTML pages using Python,see Automated Downloading with Wget and Downloading MultipleRecords Using Query Strings.)

Identify content

One of the first things Beautiful Soup can help us with is locatingcontent that is buried within the HTML structure. Beautiful Soup allowsyou to select content based upon tags (example: soup.body.p.b finds thefirst bold item inside a paragraph tag inside the body tag in thedocument). To get a good view of how the tags are nested in thedocument, we can use the method “prettify” on our soup object.

Create a new text file called “soupexample.py” in the same location asyour downloaded HTML file. This file will contain the Python script thatwe will be developing over the course of the tutorial.

To begin, import the Beautiful Soup library, open the HTML file and passit to Beautiful Soup, and then print the “pretty” version in theterminal.

Save “soupexample.py” in the folder with your HTML file and go to thecommand line. Navigate (use ‘cd’) to the folder you’re working in andexecute the following:

You should see your terminal window fill up with a nicely indentedversion of the original html text (see Figure 3). This is a visualrepresentation of how the various tags relate to one another.

Using BeautifulSoup to select particular content

Remember that we are interested in only the names and URLs of thevarious member of the 43rd Congress. Looking at the “pretty” version ofthe file, the first thing to notice is that the data we want is not toodeeply embedded in the HTML structure.

Both the names and the URLs are, most fortunately, embedded in “<a>”tags. So, we need to isolate out all of the “<a>” tags. We can do thisby updating the code in “soupexample.py” to the following:

Scraping

Note that we added a “#” to the beginning of the print(soup.prettify()) line. The hash or pound sign “comments out” the code, or turns a line of code into a comment. This tells the computer to skip over the line when executing the program. Commenting out code that is no longer in use is one way to keep track of what we have done in the past.

Save and run the script again to see all of the anchor tags in thedocument.

One thing to notice is that there is an additional link in our file –the link for an additional search.

We can get rid of this with just a few lines of code. Going back to thepretty version, notice that this last “<a>” tag is not within thetable but is within a “<p>” tag.

Because Beautiful Soup allows us to modify the HTML, we can remove the“<a>” that is under the “<p>” before searching for all the “<a>”tags.

To do this, we can use the “decompose” method, which removes thespecified content from the “soup”. Do be careful when using“decompose”—you are deleting both the HTML tag and all of the datainside of that tag. If you have not correctly isolated the data, you maybe deleting information that you wanted to extract. Update the file asbelow and run again.

Success! We have isolated out all of the links we want and none of the links we don’t!

Stripping Tags and Writing Content to a CSV file

But, we are not done yet! There are still HTML tags surrounding the URLdata that we want. And we need to save the data into a file in order touse it for other projects.

In order to clean up the HTML tags and split the URLs from the names, weneed to isolate the information from the anchor tags. To do this, wewill use two powerful, and commonly used Beautiful Soup methods:contents and get.

Where before we told the computer to print each link, we now want thecomputer to separate the link into its parts and print those separately.For the names, we can use link.contents. The “contents” method isolatesout the text from within html tags. For example, if you started with

you would be left with “This is my Header text” after applying thecontents method. In this case, we want the contents inside the first tagin “link”. (There is only one tag in “link”, but since the computerdoesn’t realize that, we must tell it to use the first tag.)

For the URL, however, “contents” does not work because the URL is partof the HTML tag. Instead, we will use “get”, which allow us to pull thetext associated with (is on the other side of the “=” of) the “href”element.

How To Learn Web Scraping

Finally, we want to use the CSV library to write the file. First, weneed to import the CSV library into the script with “import csv.” Next,we create the new CSV file when we “open” it using “csv.writer”. The “w”tells the computer to “write” to the file. And to keep everythingorganized, let’s write some column headers. Finally, as each line isprocessed, the name and URL information is written to our CSV file.

Web Scraping Online

When executed, this gives us a clean CSV file that we can then use forother purposes.

We have solved our puzzle and have extracted names and URLs from theHTML file.

But wait! What if I want ALL of the data?

Let’s extend our project to capture all of the data from the webpage. Weknow all of our data can be found inside a table, so let’s use “<tr>”to isolate the content that we want.

Looking at the print out in the terminal, you can see we have selected alot more content than when we searched for “<a>” tags. Now we need tosort through all of these lines to separate out the different types ofdata.

Extracting the Data

We can extract the data in two moves. First, we will isolate the linkinformation; then, we will parse the rest of the table row data.

For the first, let’s create a loop to search for all of the anchor tagsand “get” the data associated with “href”.

Web Scraping Service

We then need to run a search for the table data within the table rows.(The “print” here allows us to verify that the code is working but isnot necessary.)

Next, we need to extract the data we want. We know that everything wewant for our CSV file lives within table data (“td”) tags. We also knowthat these items appear in the same order within the row. Because we aredealing with lists, we can identify information by its position withinthe list. This means that the first data item in the row is identifiedby [0], the second by [1], etc.

Web scraping source codes

Because not all of the rows contain the same number of data items, weneed to build in a way to tell the script to move on if it encounters anerror. This is the logic of the “try” and “except” block. If aparticular line fails, the script will continue on to the next line.

Within this we are using the following structure:

We are applying the “get_text” method to the 2nd element in the row(because computers count beginning with 0) and creating a string fromthe result. This we assign to the variable “years”, which we will use tocreate the CSV file. We repeat this for every item in the table that wewant to capture in our file.

Writing the CSV file

The last step in this file is to create the CSV file. Here we are usingthe same process as we did in Part I, just with more variables.

As a result, our file will look like:

You’ve done it! You have created a CSV file from all of the data in the table, creating useful data from the confusion of the html page.