Deals have been signed, commitments made, and Big Data Week is just around the corner – so perhaps it’s time for another review of data.gov.my, the official Open Data portal of the Malaysian Federal Government.
Unlike the previous review, where not much change was found, this time there’s been a lot of activity.
In total there’s a whopping 44 new data assets (comparing the MYGovDataSet scrape on 25th October 2014 to a new scrape on 7th April 2015). That’s 44 on top of the existing 121, which is a 36% increase.
Picking one of the new entries at random, let’s look at data set 133: Ministry of Transport, a state-by-state breakdown of annual traffic accident count from 2003-2013. It is provided as an Excel file, which, while not ideal, is a huge step up from PDF files. Glancing at a couple of Excel files from the Ministry of Transport, at least they are formatted consistently.
My scrape also indicates that since October 2014, 4 datasets have been removed from the site: IDs 114 to 117. Unfortunately I did not save what those assets were, and obviously they are inaccessible now (I haven’t tried looking at the Internet Archive or any other caching service).
There could be any number of reasons why these data sets have been apparently removed. Perhaps they were simply moved to new IDs, or it could be they were erroneous entries to begin with. Hopefully it is something benign like that rather than some data provider having second thoughts and requesting that something be removed.
Unlike with previous scrapes, there are now some holes in the ID space (e.g. after dataset 141, the next available dataset is 144; ids 142 and 143 seem to be unused), so it seems there’s been a change in how the site is administered. It could potentially be how they deal with upload errors; simply create a new entry and delete the old one, instead of fixing the old one.
Here’s something that may be more a criticism about the functioning of the Malaysian government rather than their open data practice per se. Where would you expect to find data on vehicle licensing? If you guessed all that would come from the Ministry of Transport, you’d be wrong.
Here we have the Prime Minister’s Department providing data that they call “Jumlah Lesen Terkumpul”, which translates to “Number of Licenses Collected”. Their description of the data set is the completely unhelpful “Jumlah Lesen Terkumpul Mengikut Kelas Lesen dan Negeri (31 Disember 2014)” which translates to “Number of licenses collected according to license class and state (31 December 2014)”. What is a “collected” license? And what sort of license are we talking about anyway?
Looking at the Excel file, we can guess that this is about licenses for public transport, perhaps a count of license revocations. It isn’t clear what it means. A bit of metadata and a better description would certainly help.
Trying out the portal’s search functionality, the term ‘lesen’ does turn up this asset, but ‘pengangkutan’, ‘teksi’, and ‘bas’ did not turn up anything, so clearly search is only over the page descriptions – which means it is critical to have a good description.
Machines, start your reading!
I’ve saved the best for last; when manually inspecting some of the newly provided data sets, something interesting stood out: I kept seeing Excel and CSV files. In previous scrapes there weren’t even any CSV files. It used to be the case that PDFs accounted for over 80% of the assets available. Now the story is very different.
PDFs now only account for 38% of the data sets available. Combined, Excel and CSV files make up 43% of the mix, which is a huge boon for machine readability. Notably, Excel files are not an open standard, but at least there are modules to work with the format, and if nothing else certainly they are easier to work with (through manual intervention if need be) than PDFs.
It’s clear that data.gov.my has gotten some love recently, and here’s hoping it keeps going.