An introduction to data serialization and Python Requests

This is a hybrid primer that covers:

Basic usage of the Python Requests package to download files from the web and, in the case of JSON text files, decode them into Python data structures.
Why we serialize data as JSON text files in the first place.

The Python Requests package
Working with JSON, the old-fashioned way
Questions and caveats

The Python Requests package

Given that contacting the Web is such a frequent task for programs, one of the most popular Python packages is the Requests library, which bills itself as "an elegant and simple HTTP library for Python, built for human beings."

Describing the Hypertext Transfer Protocol is beyond the scope of this article, but we only care about one type of HTTP method: GET, which is what our web browser typically uses to download webpages, including Wikipedia pages.

A standard GET call in Requests

This is the code needed to download a page with Requests:

# Bring in the library
import requests
# make the GET call
resp = requests.get("http://www.example.com")

The Response object

The result of the get method, i.e. the web server's response, i.e. what's been assigned to the resp variable above, is a Response object. You can print out its type to verify this:

print(type(resp))  # => <class 'requests.models.Response'>

According to the Requests documentation, we can access the content of the response (i.e. the raw HTML in this example) with the text attribute:

txt = resp.text
# print the size of the webpage's content:
print(len(txt))  # => 1270

There's other useful attributes, too:

print(resp.ok)           # => True
print(resp.status_code)  # => 200
print(resp.headers['content-type'])   # => "text/html"

The headers attribute contains the response headers as Python dictionary. The full object looks like this:

{
  "content-length": "1270",
  "content-type": "text/html",
  "etag": '359670651"',
  "cache-control": "max-age=604800",
  "server": "ECS (cpm/F9D5)",
  "date": "Mon, 20 Apr 2015 12:16:24 GMT",
  "x-cache": "HIT",
  "x-ec-custom-error": "1",
  "accept-ranges": "bytes",
  "last-modified": "Fri, 09 Aug 2013 23:54:35 GMT",
  "expires": "Mon, 27 Apr 2015 12:16:24 GMT"
}

Working with JSON and Requests

The Requests Response object has a convenience method named json, which decodes (i.e. converts) the text of a JSON file into a Python data structure (e.g. a Dictionary).

Visit the URL of https://status.github.com/api/status.json to see the JSON used by Github to represent its current system status:

{
  status: "good",
  last_updated: "2015-04-20T11:50:51Z"
}

Accessing that with Requests and its json method:

resp = requests.get("https://status.github.com/api/status.json")
x = resp.json()
print("Github's status is currently:", x['status'], 'as of:', x['last_updated'])
# Github's status is currently: good as of: 2015-04-20T11:54:13Z

Working with JSON, the old-fashioned way

The json decoding method of the Requests Response object is very handy, but it's important to understand the "magic" underneath the hood, if you're still new to Python data structures and, more importantly, the fundamental concept of serializing data objects, i.e. taking data from inside your program and converting it into text that can be saved as text, and read-in by other programs.

The main takeaway is: JSON files are just text. The Response's json method does the work of converting that text into a data object (i.e. a dictionary) that can be used inside a Python program. But no matter if you're accessing status.github.com/api/status.json from the browser, the Requests library, or any other URL-fetching code, its content is just text.

Why we serialize

When we visit the Github Site Status API URL and receive a JSON text file, what we can surmise is that the Github webserver has some kind of program that surmises Github's current "state", and when things are "good", returns the value of "good" and the exact time ("2015-04-20T11:54:13Z") of the check.

There's no reason why Github, instead of this dry JSON file:

{"status":"good","last_updated":"2015-04-20T11:50:51Z"}

– couldn't produce a nice, to-the-point page in the flavor of Is the L Train F**cked?:

In fact, Github has a nice status page at status.github.com, and the data we see there originates from the Github Site Status API.

But imagine trying to write a program that reads that pretty status page and tries to simply find out if things are "f–ked" or not? Pretend that parsing HTML was something you just knew how to do: what would your code look for? The phrase "All systems operational"? Or for text in a light-green color? Either way, all of those are harder to write a program for. But interpreting the JSON format is easy:

resp = requests.get("https://status.github.com/api/status.json")
x = resp.json()
print("As of: ", x['last_updated'])
if x['status'] == 'good':
    print("Github is not f--ked")
else:
    print("Github may be f--ked")

By making an API that responds in JSON, Github makes it easy for their own web developers to create a status webpage, and more importantly, makes it easy for anyone else to make their own status webpage, including a "Is Github F–cked" if they so wished.

Serializing/Decoding JSON in Python

Let's revisit the json method of the Requests Response object:

import requests
resp = requests.get("https://status.github.com/api/status.json")
obj = resp.json()
print(type(obj))
# <class 'dict'>
print(obj)
# {'last_updated': '2015-04-20T12:36:05Z', 'status': 'good'}

The standard Python library contains its own json module, with a load method:

import requests
import json
resp = requests.get("https://status.github.com/api/status.json")
####  the beginning of the relevant code for resp.json():
txt = resp.text
obj = json.loads(txt)
##### the end of the relevant code
print(type(obj))
# <class 'dict'>
print(obj)
# {'last_updated': '2015-04-20T12:36:05Z', 'status': 'good'}

Nothing complicated here, but it's again worth reiterating that converting JSON text to Python data objects is not something exclusive to the Requests library; indeed, the reason why JSON is so popular is that virtually every modern programming language has a standard way to serialize/decode it.

Serializing objects into JSON

A quick sidetrack: So we know how to turn JSON text into Python data objects. But how do we turn data objects into JSON text (maybe at some point, we want to create an API of our own that responds with JSON)?

This is not something that the Requests library has to explicitly care about, since it only cares about requesting data from websites (although we'll see later one example of how Requests serializes a Python dictionary into something that is not JSON).

But here's how to serialize a Python data object as JSON, using Python's json module and its dumps method:

import json
mydata = {"name": "Apple", "quantity": 42, "date": "2014-02-27" }
serializeddata = json.dumps(mydata)
## to write to a textfile named "whatever":
f = open("whatever", "w")
f.write(serializeddata)
f.close()

Questions and caveats

Before moving on, make sure you know offhand the answers to these questions. Or, at least know the motivation behind these questions (and how to find the answers):

Given this code:

import requests
url = 'https://status.github.com/api/status.json'
resp = requests.get(url)
o = resp.json()

What type of object does the resp variable contain?
How do we access the textual content of the response?
In this case, the textual content of the response is formatted as JSON. If you know that the data object it represents contains a last_updated attribute, how would you print the value of that attribute?
For the given URL and response, printing out its text attribute, versus printing out the result of its json method –i.e. doing print(resp.text) and print(resp.json()) – results in nearly the same output. Why is that?
If url pointed instead to http://www.example.com, the command requests.get(url) would result in an error. Why?

Attributes versus methods

Sidenote: You may have noticed in my above notes that given a Requests Response object, I refer to it having status_code and text attributes, and having a method named json. What's the difference? Well, syntax-wise, note how methods require the use inclusion of parentheses in order to get an actual result:

import requests
url = 'https://status.github.com/api/status.json'
resp = requests.get(url)
print(resp.status_code)
# => 200
print(resp.json())
# => {'status': 'good', 'last_updated': '2015-04-20T13:08:28Z'}
print(resp.json()['status'])
# => 'good'

Forgetting the parentheses will result in this:

print(resp.json)
# <bound method Response.json of <Response [200]>>

Why are some things attributes (e.g. text, status_code) and other things methods? A fuller explanation would require getting into the principles of object-oriented programming and data design. For now, it's enough to think of a method that has to be invoked/executed, and that the syntax for doing so is to include the pair of parentheses right after the method name. Without the parentheses, the Python interpreter thinks you're just trying to print the type of thing that resp.json is.

And it kind of makes sense that json is something that has to be executed; when we looked at how to decode JSON using the Python standard library, we had to execute the json.loads() method. Anyway, in lieu of going into greater detail here, just be able to recognize the syntactical difference between an attribute and a method.