Exploring the Wikipedia JSON API

Note: As you can tell, I drafted out this page, and didn't finish the formatting part. What can I say, the Mediawiki API documentation can be pretty dense. However, I have the two related pages in working order:

The Mediawiki API Docs

The documentation to the Mediawiki API (which is what Wikipedia runs off of) is here: http://www.mediawiki.org/wiki/API:Main_page. The following is an abbreviated introduction with Python examples.

Our objective is to get data about any given Wikipedia page, such as Stanford University, in a more machine-readable format than HTML. Things we are interested in knowing are: what's on the page, how big is it, who/what/when/where/why users have been editing it.

The Query Module

The Mediawiki API has many functions and endpoints beyond just reading data; for example, there are many Wikipedia bots for the automating of article edits. But we only care about reading, or getting the data. And for that, there's the Query Module, which "allows you to get most of the data stored in a wiki, such as the wikitext of a particular article."

Sidenote: Here's a fun example of a semi-autonomous bot, in which the owner queries for a particular phrase and then has to hand-edit it himself : One Man’s Quest to Rid Wikipedia of Exactly One Grammatical Mistake

The Query module itself has an extensive list of submodules. For now, we care about the Property-type of queries, which allows us to "get various data about a list of pages specified with either the titles=, pageids=, or revids= parameters."

And here are the kinds of property queries that are interesting to us right now:

  • info - basic information about a page, including its unique ID, canonical title, last revision, size, and number of "watchers."
  • revisions - a list of changes/edits made to a page.

Testing out the API via Browser

http://en.wikipedia.org/w/api.php?action=query&prop=info&titles=Stanford_University

Getting the raw data:

http://en.wikipedia.org/w/api.php?action=query&prop=info&format=json&titles=Stanford_University


The components of a HTTP Call:

The base endpoint

 http://en.wikipedia.org/w/api.php?

The parameters

These are required for :

  • action=query
  • prop=info
  • format=json

These are conditional:

titles=Stanford_University

Getting more than one title

titles=Stanford_University|Harvard_University

Getting extra attributes with in_prop

  • protection: The protection level
  • watchers: The number of watchers
http://en.wikipedia.org/w/api.php?action=query&prop=info&titles=Stanford_University&inprop=protection watchers

Extracts

prop=extracts


The revisions property

The revisions history for the Stanford University Wikipedia page, in reverse chronological order:

http://en.wikipedia.org/w/index.php?title=Stanford_University&action=history

To see the last 500:

http://en.wikipedia.org/w/index.php?title=Stanford_University&offset=&limit=500&action=history

As JSON

Setting the prop attribute to revisions:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Stanford%20University

{
    "query": {
        "pages": {
            "26977": {
                "pageid": 26977,
                "ns": 0,
                "title": "Stanford University",
                "revisions": [
                    {
                        "revid": 657035496,
                        "parentid": 656920908,
                        "user": "Kuru",
                        "timestamp": "2015-04-18T13:33:37Z",
                        "comment": "rmv non-[[WP:RS]]"
                    }
                ]
            }
        }
    }
}

Let's add a few things: we want size, tags, flags, and comment:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Stanford%20University&rvprop=timestamp user size tags flags comment ids
{
    "query": {
        "pages": {
            "26977": {
                "pageid": 26977,
                "ns": 0,
                "title": "Stanford University",
                "revisions": [
                    {
                        "revid": 657035496,
                        "parentid": 656920908,
                        "user": "Kuru",
                        "timestamp": "2015-04-18T13:33:37Z",
                        "size": 155502,
                        "comment": "rmv non-[[WP:RS]]",
                        "tags": []
                    }
                ]
            }
        }
    }
}

To get more than one comment, we use rvlimit:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Stanford%20University&rvprop=timestamp user size tags flags comment ids&rvlimit=5

Pagination

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Stanford%20University&rvprop=timestamp user size tags flags parsedcomment ids&rvlimit=5&rvcontinue=656306460

Getting the contents of the diff

http://en.wikipedia.org/w/api.php?action=query&rvcontentformat=text/plain&prop=revisions&rvdiffto=prev&titles=Stanford%20University&rvprop=timestamp user size tags flags parsedcomment ids&offset=12&rvlimit=1&rvcontinue=656306460