httrack for downloading websites

Scraping web pages for offline hosting can be handy for testing. I’m a long-time wget fan, but for pulling down entire web pages, CSS/JS bits and all, it just trips up too easily, so I needed something better. Some quick googling revealed the venerable httrack tool.

There are loads of positive comments about httrack all over the Interwebs, so after a quick brew install httrack I pointed it at

The initial experience was not great, but then I realized where the problem was: redirects to, and I guess the redirect confuses httrack. Pointing httrack directly to the redirected url produces a much better result.

I’d also read that sometimes robots.txt files will mask CSS and JS files. Luckily httrack provides an argument to ignore robots.txt.

Finally, aided by the interactive-mode wizard, the httrack command I ended up with looks as follows:

httrack  -O "/tmp/webscrapetests/httrack/cnn1/f3" --mirrorlinks -%v --robots=3 -r4

That is a rather heavy weight approach:

Bytes saved: 	145,02MiB	       Links scanned: 	74/651 (+556)
Time: 	12min37s	               Files written: 	567
Transfer rate: 	24,26KiB/s (24,95KiB/s)Files updated: 	0
Active connections: 	4	       Errors: 	15

Current job: parsing HTML file (54%)
 request - 	0B / 	8,00KiB
 receive - 	9,48KiB / 	21,43KiB
 receive - 	2,11KiB / 	41,15KiB
 receive - 	17,65KiB / 	32,34KiB

But since it’s a one-off operation, doesn’t seem like such a problem. Once it’s done, fire up python -m SimpleHTTPServer 8000, disable your wifi, then point your browser at localhost port 8000 – some things may be broken, like ads (win?), but most of the page should load just fine.

