httrack for downloading websites

Scraping web pages for offline hosting can be handy for testing. I’m a long-time wget fan, but for pulling down entire web pages, CSS/JS bits and all, it just trips up too easily, so I needed something better. Some quick googling revealed the venerable httrack tool.

There are loads of positive comments about httrack all over the Interwebs, so after a quick brew install httrack I pointed it at http://www.cnn.com.

The initial experience was not great, but then I realized where the problem was: http://www.cnn.com redirects to http://edition.cnn.com/, and I guess the redirect confuses httrack. Pointing httrack directly to the redirected url produces a much better result.

I’d also read that sometimes robots.txt files will mask CSS and JS files. Luckily httrack provides an argument to ignore robots.txt.

Finally, aided by the interactive-mode wizard, the httrack command I ended up with looks as follows:

httrack http://edition.cnn.com  -O "/tmp/webscrapetests/httrack/cnn1/f3" --mirrorlinks -%v --robots=3 -r4

That is a rather heavy weight approach:

Bytes saved: 	145,02MiB	       Links scanned: 	74/651 (+556)
Time: 	12min37s	               Files written: 	567
Transfer rate: 	24,26KiB/s (24,95KiB/s)Files updated: 	0
Active connections: 	4	       Errors: 	15

Current job: parsing HTML file (54%)
 request - 	i2.cdn.turner.com/cnnnext/dam/assets/160323081635-money-consumer-reports-smartphone-large-tease.jpg 	0B / 	8,00KiB
 receive - 	i2.cdn.turner.com/cnnnext/dam/assets/160307120744-future-of-adventure-concept-bike-4-large-tease.jpg 	9,48KiB / 	21,43KiB
 receive - 	i2.cdn.turner.com/cnnnext/dam/assets/160403134416-money-toshiba-battery-recall-large-tease.jpg 	2,11KiB / 	41,15KiB
 receive - 	i2.cdn.turner.com/cnnnext/dam/assets/160413092137-money-amazon-kindle-oasis-large-tease.jpg 	17,65KiB / 	32,34KiB

But since it’s a one-off operation, doesn’t seem like such a problem. Once it’s done, fire up python -m SimpleHTTPServer 8000, disable your wifi, then point your browser at localhost port 8000 – some things may be broken, like ads (win?), but most of the page should load just fine.

Advertisements
This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s