Scraping web pages for offline hosting can be handy for testing. I’m a long-time
wget fan, but for pulling down entire web pages, CSS/JS bits and all, it just trips up too easily, so I needed something better. Some quick googling revealed the venerable
There are loads of positive comments about
httrack all over the Interwebs, so after a quick
brew install httrack I pointed it at http://www.cnn.com.
The initial experience was not great, but then I realized where the problem was: http://www.cnn.com redirects to http://edition.cnn.com/, and I guess the redirect confuses
httrack directly to the redirected url produces a much better result.
I’d also read that sometimes
robots.txt files will mask CSS and JS files. Luckily
httrack provides an argument to ignore
Finally, aided by the interactive-mode wizard, the
httrack command I ended up with looks as follows:
httrack http://edition.cnn.com -O "/tmp/webscrapetests/httrack/cnn1/f3" --mirrorlinks -%v --robots=3 -r4
That is a rather heavy weight approach:
Bytes saved: 145,02MiB Links scanned: 74/651 (+556) Time: 12min37s Files written: 567 Transfer rate: 24,26KiB/s (24,95KiB/s)Files updated: 0 Active connections: 4 Errors: 15 Current job: parsing HTML file (54%) request - i2.cdn.turner.com/cnnnext/dam/assets/160323081635-money-consumer-reports-smartphone-large-tease.jpg 0B / 8,00KiB receive - i2.cdn.turner.com/cnnnext/dam/assets/160307120744-future-of-adventure-concept-bike-4-large-tease.jpg 9,48KiB / 21,43KiB receive - i2.cdn.turner.com/cnnnext/dam/assets/160403134416-money-toshiba-battery-recall-large-tease.jpg 2,11KiB / 41,15KiB receive - i2.cdn.turner.com/cnnnext/dam/assets/160413092137-money-amazon-kindle-oasis-large-tease.jpg 17,65KiB / 32,34KiB
But since it’s a one-off operation, doesn’t seem like such a problem. Once it’s done, fire up
python -m SimpleHTTPServer 8000, disable your wifi, then point your browser at localhost port 8000 – some things may be broken, like ads (win?), but most of the page should load just fine.