Due date: Tuesday, June 9
A sample repo with how things should look can be found here:
(It comes with free answers, so it's worth checking out)
- In your
compjour-hwrepo, make a folder named
- For each problem, create a new Python script named after the number of the problem
compjour-hw/ |_____ search-script-scrape/ |______ 1.py |______ 2.py
Or, if you created a separate repo for it:
search-script-scrape/ |______ 1.py |______ 2.py
Each script should have a print() statement that outputs the answer.
You only have to do 50 of these exercises.
Go ahead and copy the spreadsheet and start making notes on each of the tasks.
Search-Script-Scrape is an ongoing homework project, in which students will have to write 101 programs to scrape 101 different bits of data from government sites.
The programs themselves are relatively simple; each will be less than 5 to 10 lines, and none require even a
for-loop to complete. So the ulterior purposes of Search-Script-Scrape are:
- Make the concept of web-scraping as ordinary as possible.
- Serve as a real-world basic Python exercise.
- Showcase the patchwork world of government institutions and the messy nature of data in general.
A Google Spreadsheet of the tasks can be found here.
Make your own spreadsheet
Students will be expected to copy this spreadsheet and use it as a baseline for creating a sort of "data-gathering diary". In the first weeks of class, many of the scraping tasks may seem impossible. That's OK, what you need to do is triage the things you can do now, and keep track of the things you need to learn how to do.
SSS is less an exercise about programming and more about how to strategically break down an otherwise overwhelming task.
Problem 1 is: Number of datasets currently listed on data.gov
That number is displayed at the top of http://www.data.gov/. You can use the Chrome DevTools to find the exact path to that element.
Sample Python 2.x/3.x script, using the requests and BeautifulSoup libraries:
Note: I previously showed an example with the BeautifulSoup library. Turns out the BeautifulSoup library seems to be malfunctioning for Anaconda. An alternative is the lxml library. Here's an example:
from lxml import html import requests response = requests.get('http://www.data.gov/') doc = html.fromstring(response.text) link = doc.cssselect('small a') print(link.text)
This is that same scrape with BeautifulSoup:
Note: Do NOT use Beautifulsoup if you can help it. I've left this up as an example but use lxml as shown in Freebie #30 below
import bs4 import requests response = requests.get('http://www.data.gov/') soup = bs4.BeautifulSoup(response.text) link = soup.select("small a") print(link.text) # Reminder: Do NOT USE BS4; use lxml, as in the below example
Here's another example with lxml (Problem #30):
- The total number of inmates executed by Florida since 1976
import requests from lxml import html response = requests.get('http://www.dc.state.fl.us/oth/deathrow/execlist.html') doc = html.fromstring(response.text) last_row = doc.cssselect('tr')[-1] td = last_row.cssselect('td') print(td.text)
The lxml documentation on HTML parsing can be found here. You can find more information about HTML scraping via The Hitchhiker's Guide to Python chapter on HTML scraping (docs.python-guide.org). Remember that lxml only applies to HTML; use json for JSON files, csv for CSV files, etc.
The Web Inspector is immensely useful for finding out how to target HTML elements.
Speaking of CSV, here's an example of how to use that library…you'll find that the
csv.DictReader class, which turns a CSV textfile into a dictionary, will be most helpful:
- The number of women currently serving in the U.S. Congress, according to Sunlight Foundation data
import csv import requests CSVURL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv' response = requests.get(CSVURL) f = open("legislators.csv", "w") f.write(response.text) f.close() # re-open leglislators.csv data = csv.DictReader(open("legislators.csv")) rows = list(data) len([i for i in rows if i['gender'] == 'F' and i['in_office'] == '1'])
Note: The csv library won't just let us turn a string into a data structure. So in the example above, I save the file to disk, and re-read/open it with
csv.reader(open(fname)). There's a more graceful way to do this with
io.StringIO. But the effect is the same:
import csv import requests from io import StringIO CSVURL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv' response = requests.get(CSVURL) data = csv.DictReader(StringIO(response.text)) rows = list(data) len([i for i in rows if i['gender'] == 'F' and i['in_office'] == '1'])
Of course, you can always use pandas, which has a built-in CSV reading function, but loading pandas is kind of overkill for this:
import pandas as pd CSVURL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv' df = pd.read_csv(CSVURL) len(df[(df['gender'] == 'F') & (df['in_office'] == 1)])