OK, so I was recently asked to help a newbie to Wikipedia (I’m gonna call them Alex, as an arbitrary name) who was having a few issues accessing Wikipedia articles from the application “Dictionary“, produced by Apple for their Mac OS X operating system. After asking Alex what the problem was, he told me that ever since the “new features” had been enabled on Wikipedia, it hadn’t worked for him. Apparently, he’d tried various articles (eg: http://reviews.cnet.com/8301-13727_7-20006290-263.html ) which suggested fixes, most of which involved disabling the new features and/or reverting to the older “monobook” skin.
At this point, I grew a little concerned. Could this be a MAJOR bug in the new features that actually broke the Wikipedia API (Application Programming Interface), that was specifically designed for external programs/websites to use to ease the interaction with Wikipedia?
Well, apparently not. Alex didn’t appear to be technically capable enough to give me the information to prove or disprove my theory, so I did a little digging and thinking. All of the new “features” actually added zero functionality to MediaWiki, the software that powers sites like Wikipedia. All it is is usability improvements etc. The API was never touched, cos there’s no need to make the API user friendly, cos it’s only of real interest to programmers and developers, who should be able to understand the reasonably straightforward documentation available on the API itself, and the linked documentation on mediawiki.org. It’s really easy to get the raw wikitext of a page, via a simple url such as http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&titles=Application%20programming%20interface. Appending &format=xml to the end even gets you it in XML format (alternatives include json, php, wddx and yaml, as well as html representations of those to aid debugging). You can get the parsed HTML of the page content even easier with this url: http://en.wikipedia.org/w/api.php?action=parse&page=Application%20programming%20interface. Both of these examples use the Application programming interface article as the requested article. It’s kinda easy, huh?
There’s another two ways of getting the article content.
2) Grab it directly out of the MySQL database behind the software, which is possible for some tools running on the Wikimedia Toolserver, which has a read-only copy of the actual database behind Wikipedia. However, this is impossible for the vast majority of users, so we’ll forget about this one.
3) Screen-scrape the content of the live articles.
I’ll admit, screen scraping is sometimes necessary to extract data from various websites that only have older technologies in them. Most modern sites that developers would care about accessing generally have some means to access the data on them, such as an API (Wikipedia, Facebook, and Twitter are among these), or RSS feeds for changing content (such as bug trackers, news sites, etc). On these sites, there’s no need to screen scrape, and doing so means you have to rely on the user interface staying exactly as it is all the time. Any small change to the design of the site, and your code will likely no longer work. An API/RSS feed is designed to be stable, so changes break stuff, unless absolutely necessary. Screen scraping also means you have to add a metric ton of code just to figure out what you need to parse, where it is, what any special chars mean, etc, and generally takes a long time and a lot of effort (and hence money) to actually write. A documented API which rarely changes should be quick and easy to write code for, and most of the time it is.
I’ve not even mentioned the server side of it either… the API (should) contains none of the user interface code, so all the fancy page rendering stuff won’t even be run, so the API will be quicker to return a result to the client, as well as saving bandwidth and server cpu time, as all of the CSS/JS and most of the HTML code will not have to be transmitted to the client, as well as the information required.
I guess what I’m trying to say is: USING AN API FOR A WEB APP IS BETTER FOR EVERYONE!
Which nicely leads me back to my original point: Apple’s app broke when a user interface changed. A user interface that’s not actually ever involved in the API. They’re screen scraping, and their app suffered quite a major error when the folks on the Usability Initiative team at Wikimedia decided to roll out their UI changes. It looked like Wikipedia had broken, and all the Apple fanatics decided it must be Wikipedia’s fault. However, it’s actually the developers at Apple’s fault for not using the API when it was provided.
So, that’s where I stood until a few hours ago. However, thanks to the weekly Wikipedia newsletter known as the Signpost, I learnt about bug 23602 on Wikimedia’s bugtracker. Apparently, Apple have fixed this breakage now, but the question is, have they learnt their lesson about screen scraping and fixed it properly by using the API?
They’re forcing the skin back to the monobook skin, and continuing to screen scrape.
To all Apple users – the programming quality appears to be ridiculously low. You paid huge sums of money for it. Apparently, they don’t understand the basics of programming with web services such as Wikipedia. And they’re refusing to learn from their mistakes by patching their code with nasty hacks to make it work again instead of fixing it properly.
In case you were wondering, I do have a problem with Apple generally, but even if I try and look at this from a positive light – a company with Apple’s reputation that seriously can’t deal with a simple problem like this properly….. I fear for the future.