The last time I looked at, I pulled out some numbers on view counts for each data set, and the different file types.

That was about 3 months ago, and in the past few days a lot has happened. At the recently concluded MSC ICM meeting big data was highlighted as a national priority area, which re-emphasises the importance of Therefore, now seems like a good time to take another look at

One specific thing I’m hoping to figure out is which data sets appear to be of most interest. I think this would be useful information because it could help focus efforts to produce machine-readable data sets. Predictably, is currently a PDF fest, but I reckon calls for the whole lot to be made machine readable are not terribly helpful (unless it leads to an executive decree to make all data available as CSV files – which is an unrealistic expectation). On the other hand, if we can highlight specific things which would be especially valuable, then it might be an easier sell. Or maybe it’ll just inspire some good samaritans to convert the right datasets into CSVs.

Therefore, my focus will be on view counts. But, it’s not as simple as just plucking out the top-N datasets by view count because, as previously observed, some dataset pages receive a view count boost due to placement on the front page of, for instance because of the Latest Datasets tab on the front page of Luckily this time we have some longitude, so we can focus on how things have changed. viewcount numbers: July vs Oct 2014

So what’s new? Not very much. There are no new datasets. But the view counts have been moving. The average view count for each dataset page has now almost doubled from 42 3 months ago to 80 today. This growth has been far from even; the standard deviation has really exploded.

There are over 121 data sets, let’s drill into the top-20 by view count. Five data sets dropped out of the top-20 in 3 months. Things in the top-5 are pretty tight, due to the aforementioned “Latest Datasets” effect (that accounts for datasets 119-121, I’m not sure what’s driving traffic to dataset 1, but presumably there’s another vanity link somewhere).

The only standout in the top-5 is dataset 119, which has posted the largest 3 month gain of everything in the top-20, even significantly outperforming it’s peers in the “Latest Datasets” family.

That aside, datasets 61, 59, and 117 had solid view count growth, outperforming all but dataset 119, with no indication that it’s numbers are artificially bloated by a “Latest Datasets” tab or similar, at least not as far as I’m aware.

With that, the nominees for “most interesting datasets on” are:

  • Dataset 61: “MCMC Annual Report 2012”
  • Dataset 59: “Educational Quick Facts 2013”
  • Dataset 117: “Performance Report On Low Cost Housing Development”
  • Dataset 119: “Aplikasi Pemetaan Belia Malaysia”

Dataset 119 is problematic because, as previously discussed, it’s not really a dataset – rather it’s a data portal. Dataset 61 is also problematic because an annual report, though interesting, is not really a data asset. So that leaves 59 and 117, both of which sound interesting and are probably valuable. Dataset 59 is a PDF, so that’s a clear candidate for converting into machine-readable form. Dataset 117 is an enigma… because it’s a broken link. Why do I even bother. Good thing 59 is pretty interesting.

