httrack for downloading websites

Scraping web pages for offline hosting can be handy for testing. I’m a long-time wget fan, but for pulling down entire web pages, CSS/JS bits and all, it just trips up too easily, so I needed something better. Some quick googling revealed the venerable httrack tool.

There are loads of positive comments about httrack all over the Interwebs, so after a quick brew install httrack I pointed it at

The initial experience was not great, but then I realized where the problem was: redirects to, and I guess the redirect confuses httrack. Pointing httrack directly to the redirected url produces a much better result.

I’d also read that sometimes robots.txt files will mask CSS and JS files. Luckily httrack provides an argument to ignore robots.txt.

Finally, aided by the interactive-mode wizard, the httrack command I ended up with looks as follows:

httrack  -O "/tmp/webscrapetests/httrack/cnn1/f3" --mirrorlinks -%v --robots=3 -r4

That is a rather heavy weight approach:

Bytes saved: 	145,02MiB	       Links scanned: 	74/651 (+556)
Time: 	12min37s	               Files written: 	567
Transfer rate: 	24,26KiB/s (24,95KiB/s)Files updated: 	0
Active connections: 	4	       Errors: 	15

Current job: parsing HTML file (54%)
 request - 	0B / 	8,00KiB
 receive - 	9,48KiB / 	21,43KiB
 receive - 	2,11KiB / 	41,15KiB
 receive - 	17,65KiB / 	32,34KiB

But since it’s a one-off operation, doesn’t seem like such a problem. Once it’s done, fire up python -m SimpleHTTPServer 8000, disable your wifi, then point your browser at localhost port 8000 – some things may be broken, like ads (win?), but most of the page should load just fine.

Posted in Uncategorized | Tagged , | Leave a comment

Do you really need a blockchain?

There’s a scene in Angels and Demons where the Camerlengo asks Robert Langdon if he believes in God. Professor Langdon replies “faith is a gift I have yet to receive“. I found this interesting because, despite not being part of the church, Langdon has skills that are crucial to it’s survival.

I’m no Robert Langdon, but perhaps another reason why that scene resonated with me is because it kind of describes my relationship with blockchain technologies. On a technical level I could not be more enthusiastic about blockchain technology. A while back I even did some paid consulting work involving blockchains. The issues are all fascinating, from a high-level with the Byzantine General’s Problem, right down to lower level stuff like the bounded control-flow of the scripting language, and even to seemingly simple things like memory buffer eviction policies designed for resilience against abuse. It’s all fun stuff.


Because DISRUPTION. Right?

However, beyond the pleasure of intellectual pursuit, for now I am firmly a blockchain skeptic. It just doesn’t seem like blockchain is as widely applicable as people say it is. Some areas like syndicated loans may be ripe for blockchains, but that’s a very niche field with high barriers to entry, and very few projects actually operate at that level. And yes, I realise there’s plenty of institutional support even on the retail end of finance, but I really don’t know if there’s much more to it than FOMO – after all, if you’re in banking, spending a bit on blockchain projects is cheap insurance.

In real conversations the skepticism often manifests as point objections, such as:

  • Full confirmation is too slow for most use cases.
  • What do you mean bitcoin enables micro-transactions? Microtransactions have been powering the Internet for a long time. Sure with Bitcoin you don’t need a prepaid model, but prepaid hasn’t been that much of a problem thus far, so who cares? And anyway many microtransactions require low latencies which again are just fundamentally at odds with the blockchain model.
  • Are you sure using a blockchain results in a cheaper solution, or is it just that more of the costs are kept off your books?

Each of the above will have pro-blockchain counter-arguments, such as SPV instead of full confirmation, but it just ends up going in circles.

These objection handling conversations can get a bit frustrating, so maybe we should stop starting from the premise that something should have a blockchain component, and then try to find flaws in the idea. What if we approached it from the other perspective: why should this thing have a blockchain?

The trouble is, in my experience, there is a tendency for people to argue that trust (or trustless-ness) is an intrinsic property of blockchain technology, and when they see a problem with some trust component they immediately think of incorporating blockchain as part of the solution. Is that reasoning really good enough? Or is trust merely a confusing straw man in this situation? We should be very clear on when and why a blockchain is needed, because it is expensive technology, and introduces major limitations on what users and operators can do.

Let’s start from first principles. What is a blockchain? And building on that, in what situations does one need a blockchain? For me, the penny dropped when I happened upon a gem in the a16z podcast. I feel Adam Ludwin really nails it about 30 minutes in, when he says (I’m paraphrasing for conciseness):

A blockchain is a database of assets that is shared with participants in the network, where a given asset is controlled by the participant who owns the asset, and not controlled by whoever controls the database.

The key thing is that whoever controls the database does not control the data. What does that mean? Let’s spell it out plainly.

If you take a copy of a database that is stored on some infrastructure that you control, e.g. the full blockchain that is downloaded by a Bitcoin client on your computer, you can go ahead and change the data as you like. It is your data. It lives on your computer. However, the data is structured in such a way that by making these arbitrary changes, you would have violated the integrity of the database, and consequently the database ceases to be valid, and is effectively no longer usable. It’s not a question of knowing how Bitcoin works – you can know everything there is about Bitcoin, but even armed with perfect understanding, you will never be able to make just the right edits to the data. The only way changes to your local copy of the database can be made without breaking the database is for the “owner” of the relevant entries within the database to make the required changes.

This the key property of blockchain technology: a copy of a dataset can exist anywhere, but to maintain the integrity of the dataset, individual items of data within the dataset can only be manipulated by a specific actor.

The word “only” in the above must be interpreted in the strictest technical sense, not “entrepreneur-level strictness”; there is no assumed database administrator with a secret key who can make the right binary-level edits or whatever. That’s the cryptographic magic of blockchain. It is absolute.

Do you need that magic? Are you sure you can’t just use a properly secured website with a normal database backend, to which you can make edits should customers screw up a request and need you to do a rollback? I mean if you’re building a solution for people to trade unused cellular hours, having Vodafone as an all-powerful database owner doesn’t seem so bad. The same could be said for a loyalty points trading platform.

It’s not a technical choice, it’s a business case choice. For your idea to work, do you need a database where there is no database owner who can make arbitrary updates?

If you answered yes, then congratulations, you indeed have a problem that may require a blockchain. Otherwise, you should default to a traditional database solution instead of needlessly accepting all the expense and loss of control that comes with a blockchain.

Maybe one day I will receive the gift of faith. But until then, I test every idea against the key requirement described above. Regardless, I’ll try to find some spare time to learn about things like lightning networks and segregated witness, just for the intellectual fun of it.

Posted in Uncategorized | Tagged | 1 Comment

libproc for process listing on OSX

On OSX if you need to manage other processes, a sensible place to start is NSRunningApplication class. You can instantiate one of these with a pid:

NSRunningApplication *a = 
    [NSRunningApplication runningApplicationWithProcessIdentifier:pid];

An NSRunningApplication object has amongst other things, the following properties:

  • executableURL
  • bundleURL
  • bundleIdentifier

That sounds promising, but there’s a catch: NSRunningApplication can’t tell you about everything running on your system. For example, process 548 on my system is the Launch Daemon. But ask NSRunningApplication about it and you’ll get:

  • executableURL: (null)
  • bundleURL: (null)
  • bundleIdentifier: (null)

On the other hand point it at Preview and NSRunningApplication is a champion:

  • executableURL: file:///Applications/
  • bundleURL: file:///Applications/
  • bundleIdentifier:

I’m not sure precisely what the criteria is for some process to be properly recognised by NSRunningApplication, maybe it needs to be registered with a bundle ID, but luckily there’s another way to enumerate processes on OSX that digs deeper into the BSD layer: libproc.

This is a poorly documented (unsupported?) C API that you can learn more about from /usr/include/libproc.h, but in a nutshell there are two calls that most people would care about:

  • proc_listpids() to get a list of pids
  • proc_pidpath() to get the executable path associated with the pid

They can be put together with something like this:

int pidCount = proc_listpids(PROC_ALL_PIDS, 0, NULL, 0);
unsigned long pidsBufSize = sizeof(pid_t) * (unsigned long)pidCount;
pid_t * pids = malloc(pidsBufSize);
bzero(pids, pidsBufSize);
proc_listpids(PROC_ALL_PIDS, 0, pids, (int)pidsBufSize);
for (int i=0; i < pidCount; i++) {
    bzero(pathBuffer, PROC_PIDPATHINFO_MAXSIZE);
    proc_pidpath(pids[i], pathBuffer, sizeof(pathBuffer));
    printf("pid %d = %s\n", pids[i], pathBuffer);

That will give you the executable path of every process, even those that NSRunningApplication cannot grok.

Posted in Uncategorized | Tagged | Leave a comment

The data-dev-ops triangle

For a while now I’ve sensed the emergence of a new species of software engineer, adding a third node to the DevOps dichotomy (though from a skills/capabilities perspective it’s actually more of a continuum).

The third vertex joining the Dev and Ops elements, is Data. DataOps has a nice ring to it, though DevData (or DataDev?) unfortunately doesn’t rolls-off-the-tongue quite so nicely. But more importantly, I think there is substance to it.

(Disclaimer: this is stream of consciousness, so a bit unrefined, and I loathe that I haven’t included a single citation/link in this, but I decided to get it out there now anyway to start somewhere)

Continue reading

Posted in Uncategorized | Leave a comment

An initial perspective on domestic mass electronic surveillance in Malaysia

The Snowden disclosures brought into public consciousness the issue of domestic mass surveillance. This has triggered debate throughout the developed world, less so in the developing world.

Curious about current perceptions on this issue in Malaysia, I posted a question to the Big Data Malaysia discussion group:

People in this group may well be regarded as Big Data experts by their friends and family, and I’m curious… are you hearing any concern about potential mass electronic surveillance* in Malaysia?

(*I mean the sort of thing brought to light by the Snowden disclosures.)

The following is not my personal opinion, rather it is my personal summary of opinions provided on the above-mentioned discussion thread. There were 39 comments in response. I coded, categorized, and weighted (by likes) each comment to produce the following summary. It is unavoidably subjective, but hopefully it’s not too far off from a useful snapshot of the opinion of the members of Big Data Malaysia on the subject of mass electronic surveillance in Malaysia.

Continue reading

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment gets serious

Deals have been signed, commitments made, and Big Data Week is just around the corner – so perhaps it’s time for another review of, the official Open Data portal of the Malaysian Federal Government.

Unlike the previous review, where not much change was found, this time there’s been a lot of activity.

Continue reading

Posted in Uncategorized | Tagged , | 3 Comments

Reflecting on the importance of domain knowledge in data science

When I was a PhD student, we had a regular internal seminar series for postgrads to present anything relevant to their work, most typically their intermediate findings. At one such session, a mobile telecommunications researcher was presenting his findings on energy consumption. He had created a simulated annealing model and crunched some data and presented some graphs. One Professor in the audience was apparently paying closer attention than most. He leaned in, squinted, then observed: “So basically what you found was that with two batteries the thing lasts twice as long?

Continue reading

Posted in Uncategorized | Tagged , , , | Leave a comment