Early-days analysis of data.gov.my

In a recent post I made some quick observations on data.gov.my, Malaysia’s brand spanking new Open Data portal. Since data loves to be meta, I went ahead and scraped the Open Data portal so that we can have some Open Data data (so meta; very data). The scraper/analyser lives at MYGovDataSet, though to obtain the results below I had to massage data and tweak the scripts in some ways that didn’t get checked in.

data.gov.my in all it's glory

Screencap 16 July 2014

I was particularly interested in the following questions:

  1. Which datasets seem to be popular?
  2. What does the asset class mix look like? Put differently, how much of it is machine readable and how much of it is PDF muck?

Counting on a view

Unfortunately, it doesn’t look like data.gov.my has been terribly popular so far – the main portal page has just over 3k hits if their hit count on the main page is to be believed. Nevertheless, in terms of the view count reported on each page, some data sets seem to stand out more than others.

Viewcount of individual dataset pages on data.gov.my, 17 Jul 2014

Unfortunately, I don’t think we can draw any conclusions from that. Of the top-5 datasets by viewcount (rank 1, rank 2, rank 3, rank 4, rank 5), The Ministry of Youth and Sports might take pride in the fact that four of those are from them, but it may simply be due to the fact that their datasets are (at the time of writing) being featured on the main page “Latest Datasets” widget. The most popular page also happens to be the latest, hence at the top of the “Latest Datasets” widget. The one bucking the trend is the dataset with ID 1 (which is the second most popular dataset by viewcount), which leads me to wonder if there is some other source which drives traffic in ascending ID order.

The scraper increments the viewcount (so clearly that’s not a count of actual data downloads; it’s a page hit counter). Unfortunately that means I’ve gone and skewed the first 20 dataset viewcounts because that’s the subset I used while testing, though based on the results it doesn’t seem like that made a big impact. It’s a bit worrying to see this viewcount measure receive prominence. A far better measure of the value of a dataset would be the number of times anyone actually bothered to click through to access the dataset, rather than visiting the portal page.

The asset mix

The portal allows for multiple asset types to be uploaded per dataset (that means each dataset may have a PDF and XLS representation, for example). There are 17 datasets in total that take advantage of this. Some of these, e.g. the Ministry of Communication and Multimedia’s dataset 61, upload two PDFs, an English version and a Malay version. It comes as a relief that all of the Prime Minister’s department’s datasets are offered with XLS files as well as PDF files. That’s probably thanks to the Economic Planning Unit I would guess.

So kudos to the Prime Minister’s department right? Well, maybe not. Though they are to be congratulated for ostensibly making XLS files available, none of the assets are currently working. Broken links. 503 service temporarily unavailable all round. I’ll give them the benefit of doubt and assume that this is just a temporary glitch.

Counting datasets which include PDF files and XLS files as XLS assets, the asset mix is a straightforward, if slightly depressing, story. No fancy dataviz needed to reveal trends here: of the 121 datasets, 102 are PDF files, 11 are XLS files, and 8 are a mix of other things. That’s 84% PDFs if you must know.

This is obviously a huge blow against machine readability. I drilled down into the 8 unknowns:

  • The Ministry of Urban Wellbeing, Housing, and Local Government, who have three assets (which they advertise as four: 114 and 115 have different descriptions but point to the same asset, while 116 and 117 point to other assets) which are all html tables. I think maybe this is slightly better than PDFs, but they came this far, it’s disappointing to see they didn’t just make CSVs available instead. One wonders if the minister himself prefers this because it’s easier for him to glance at on his iPad.
  • The Ministry of Youth and Sports is a mixed bag. They have 5 datasets up. One of the assets is a PDF (60). Another one is a PNG file (118) which is plain stupid and should be removed. They have two dataset pages that simply points to their PetaBelia webapp (119 and 121, I imagine these can probably be unified). The PetaBelia mapping webapp does look like it might be interesting, but the real standout is dataset ID 120, which points to a webform, from which one can generate XLS files. CSVs would have been better, but in a sea of PDFs, this is great to see, especially because there’s quite some richness to the data.

The Prime Minister’s Department (and specifically the EPU) might want to work on availability, but that aside overall the ministries need to do better in terms of releasing data in the right format. The Ministry of Urban Wellbeing, Housing, and Local Government may be on to something with their HTML tables; this is not far off from just being raw CSV files, and at least they have some standardisation. The Ministry of Youth and Sports is an enigma: on the one hand they put up the worst asset (a PNG file, seriously?) but also the best asset. More on that below.


There’s not been an explosion of interest in data.gov.my yet, but I don’t think there’s been much in the way of marketing yet. The ‘latest datasets’ widget is only useful if the site is frequently updated with new datasets, and if it sees a lot of frequent repeat visitors who are looking to see what’s new, but otherwise all it does is skew the view count statistics unnecessarily as curious on-lookers filter through the site.

The EPU’s broken links also raises a concern. Are these data custodians used to thinking about the availability of their infrastructure? Maybe not. In their day-to-day work they may be more accustomed to firing these data files at each other over email, which is a practice that obviously can’t scale for an Open Data platform. Maybe MAMPU should take ownership of hosting data assets where possible.

More worrying is the dominance of PDF assets. This is something that needs to be addressed, as PDF documents are notoriously difficult to work with. This problem is aggravated by the fact that in some cases the PDFs are reports which contain tables of figures, rather than being raw data files. The fact that complete reports (even annual reports!) are being offered as “data” assets on this portal does somewhat undermine the apparent commitment to Open Data, and it would be great to see more effort to put up actual data files. Having said that, the ministries that contributed to this are to be commended – not every ministry contributed something, and even for those who did, there are some glaring omissions. For instance, KDN provided drug abuse stats, but what people really want to see are crime stats.

A clear standout data asset is the Ministry of Youth and Sports Statistik Sikap Belia Malaysia dataset (120). This is by far the most interesting data asset I’ve noticed because it is itself a data portal, that provides downloadable spreadsheets for different segments. Ironically despite being the highest quality asset, their portal page is tainted by broken metadata in the form of a truncated description: “Statistik Sikap Belia Malaysia mengikut”.

A couple of other general quality observations:

  • In the course of this work, I’ve noticed the site is unavailable quite often. Like right now, 10.55am Malaysia time, 17 July 2014, double checked with http://www.downforeveryoneorjustme.com/data.gov.my.
  • At one point the Ministry of Tourism and Culture had some bad metadata in the form of “Last Updated: 0000-00-00 00:00:00”. That was breaking my parser because datetime doesn’t work with a date range that far back. This impacted 8987, and 95. However, it’s very encouraging that as I write this, that error has been fixed, and they didn’t just set the date to the current time either; they took the trouble to backdate it, presumably to around the actual upload time.

Bugs happen, but maintenance is very good to see.

