- The Mediawiki API Docs
- The Query Module
- Testing out the API via Browser
- The components of a HTTP Call:
- The revisions property
Note: As you can tell, I drafted out this page, and didn't finish the formatting part. What can I say, the Mediawiki API documentation can be pretty dense. However, I have the two related pages in working order:
The Mediawiki API Docs
The documentation to the Mediawiki API (which is what Wikipedia runs off of) is here: http://www.mediawiki.org/wiki/API:Main_page. The following is an abbreviated introduction with Python examples.
Our objective is to get data about any given Wikipedia page, such as Stanford University, in a more machine-readable format than HTML. Things we are interested in knowing are: what's on the page, how big is it, who/what/when/where/why users have been editing it.
The Query Module
The Mediawiki API has many functions and endpoints beyond just reading data; for example, there are many Wikipedia bots for the automating of article edits. But we only care about reading, or getting the data. And for that, there's the Query Module, which "allows you to get most of the data stored in a wiki, such as the wikitext of a particular article."
Sidenote: Here's a fun example of a semi-autonomous bot, in which the owner queries for a particular phrase and then has to hand-edit it himself : One Man’s Quest to Rid Wikipedia of Exactly One Grammatical Mistake
The Query module itself has an extensive list of submodules. For now, we care about the Property-type of queries, which allows us to "get various data about a list of pages specified with either the titles=, pageids=, or revids= parameters."
And here are the kinds of property queries that are interesting to us right now:
- info - basic information about a page, including its unique ID, canonical title, last revision, size, and number of "watchers."
- revisions - a list of changes/edits made to a page.
Testing out the API via Browser
http://en.wikipedia.org/w/api.php?action=query&prop=info&titles=Stanford_University
Getting the raw data:
http://en.wikipedia.org/w/api.php?action=query&prop=info&format=json&titles=Stanford_University
The components of a HTTP Call:
The base endpoint
http://en.wikipedia.org/w/api.php?
The parameters
These are required for :
action=query
prop=info
format=json
These are conditional:
titles=Stanford_University
Getting more than one title
titles=Stanford_University|Harvard_University
Getting extra attributes with in_prop
protection
: The protection levelwatchers
: The number of watchers
http://en.wikipedia.org/w/api.php?action=query&prop=info&titles=Stanford_University&inprop=protection | watchers |
Extracts
The revisions property
The revisions history for the Stanford University Wikipedia page, in reverse chronological order:
http://en.wikipedia.org/w/index.php?title=Stanford_University&action=history
To see the last 500:
http://en.wikipedia.org/w/index.php?title=Stanford_University&offset=&limit=500&action=history
As JSON
Setting the prop
attribute to revisions
:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Stanford%20University
{
"query": {
"pages": {
"26977": {
"pageid": 26977,
"ns": 0,
"title": "Stanford University",
"revisions": [
{
"revid": 657035496,
"parentid": 656920908,
"user": "Kuru",
"timestamp": "2015-04-18T13:33:37Z",
"comment": "rmv non-[[WP:RS]]"
}
]
}
}
}
}
Let's add a few things: we want size
, tags
, flags
, and comment
:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Stanford%20University&rvprop=timestamp | user | size | tags | flags | comment | ids |
{
"query": {
"pages": {
"26977": {
"pageid": 26977,
"ns": 0,
"title": "Stanford University",
"revisions": [
{
"revid": 657035496,
"parentid": 656920908,
"user": "Kuru",
"timestamp": "2015-04-18T13:33:37Z",
"size": 155502,
"comment": "rmv non-[[WP:RS]]",
"tags": []
}
]
}
}
}
}
To get more than one comment, we use rvlimit
:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Stanford%20University&rvprop=timestamp | user | size | tags | flags | comment | ids&rvlimit=5 |
Pagination
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Stanford%20University&rvprop=timestamp | user | size | tags | flags | parsedcomment | ids&rvlimit=5&rvcontinue=656306460 |
Getting the contents of the diff
http://en.wikipedia.org/w/api.php?action=query&rvcontentformat=text/plain&prop=revisions&rvdiffto=prev&titles=Stanford%20University&rvprop=timestamp | user | size | tags | flags | parsedcomment | ids&offset=12&rvlimit=1&rvcontinue=656306460 |