Search-Script-Scrape

Due date: Tuesday, June 9

Sample Repo

A sample repo with how things should look can be found here:

https://github.com/compjour/search-script-scrape

(It comes with free answers, so it's worth checking out)

Submission

  1. In your compjour-hw repo, make a folder named search-script-scrape.
  2. For each problem, create a new Python script named after the number of the problem

e.g.

compjour-hw/
  |_____ search-script-scrape/
      |______ 1.py
      |______ 2.py

Or, if you created a separate repo for it:

search-script-scrape/
      |______ 1.py
      |______ 2.py

Each script should have a print() statement that outputs the answer.

You only have to do 50 of these exercises.

Go ahead and copy the spreadsheet and start making notes on each of the tasks.

Purpose

Search-Script-Scrape is an ongoing homework project, in which students will have to write 101 programs to scrape 101 different bits of data from government sites.

The programs themselves are relatively simple; each will be less than 5 to 10 lines, and none require even a for-loop to complete. So the ulterior purposes of Search-Script-Scrape are:

  1. Make the concept of web-scraping as ordinary as possible.
  2. Serve as a real-world basic Python exercise.
  3. Showcase the patchwork world of government institutions and the messy nature of data in general.

A Google Spreadsheet of the tasks can be found here.

Make your own spreadsheet

Students will be expected to copy this spreadsheet and use it as a baseline for creating a sort of "data-gathering diary". In the first weeks of class, many of the scraping tasks may seem impossible. That's OK, what you need to do is triage the things you can do now, and keep track of the things you need to learn how to do.

SSS is less an exercise about programming and more about how to strategically break down an otherwise overwhelming task.

Guidelines

Freebies

Freebie #1

Problem 1 is: Number of datasets currently listed on data.gov

That number is displayed at the top of http://www.data.gov/. You can use the Chrome DevTools to find the exact path to that element.

Sample Python 2.x/3.x script, using the requests and BeautifulSoup libraries:

Note: I previously showed an example with the BeautifulSoup library. Turns out the BeautifulSoup library seems to be malfunctioning for Anaconda. An alternative is the lxml library. Here's an example:

from lxml import html
import requests
response = requests.get('http://www.data.gov/')
doc = html.fromstring(response.text)
link = doc.cssselect('small a')[0]
print(link.text)

This is that same scrape with BeautifulSoup:

Note: Do NOT use Beautifulsoup if you can help it. I've left this up as an example but use lxml as shown in Freebie #30 below

import bs4
import requests
response = requests.get('http://www.data.gov/')
soup = bs4.BeautifulSoup(response.text)
link = soup.select("small a")[0]
print(link.text)
# Reminder: Do NOT USE BS4; use lxml, as in the below example

Freebie #30

Here's another example with lxml (Problem #30):

  1. The total number of inmates executed by Florida since 1976
import requests
from lxml import html
response = requests.get('http://www.dc.state.fl.us/oth/deathrow/execlist.html')
doc = html.fromstring(response.text) 
last_row = doc.cssselect('tr')[-1]
td = last_row.cssselect('td')[0]
print(td.text)

The lxml documentation on HTML parsing can be found here. You can find more information about HTML scraping via The Hitchhiker's Guide to Python chapter on HTML scraping (docs.python-guide.org). Remember that lxml only applies to HTML; use json for JSON files, csv for CSV files, etc.

The Web Inspector is immensely useful for finding out how to target HTML elements.

Freebie #101

Speaking of CSV, here's an example of how to use that library…you'll find that the csv.DictReader class, which turns a CSV textfile into a dictionary, will be most helpful:

  1. The number of women currently serving in the U.S. Congress, according to Sunlight Foundation data
import csv
import requests
CSVURL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
response = requests.get(CSVURL)
f = open("legislators.csv", "w")
f.write(response.text)
f.close()
# re-open leglislators.csv
data = csv.DictReader(open("legislators.csv"))
rows = list(data)
len([i for i in rows if i['gender'] == 'F' and i['in_office'] == '1'])

Note: The csv library won't just let us turn a string into a data structure. So in the example above, I save the file to disk, and re-read/open it with csv.reader(open(fname)). There's a more graceful way to do this with io.StringIO. But the effect is the same:

import csv
import requests
from io import StringIO
CSVURL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
response = requests.get(CSVURL)
data = csv.DictReader(StringIO(response.text))
rows = list(data)
len([i for i in rows if i['gender'] == 'F' and i['in_office'] == '1'])

Of course, you can always use pandas, which has a built-in CSV reading function, but loading pandas is kind of overkill for this:

import pandas as pd
CSVURL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
df = pd.read_csv(CSVURL)
len(df[(df['gender'] == 'F') & (df['in_office'] == 1)])