The Zombie Stack Exchanges That Just Won't Die
In particular, I'm thinking about Archive-it, HTTrack, Heritrix, and WGet. I would be particularly interested in which tools make sense in small, medium and large cultural heritage organizations (Libraries, Archives, Museums, Historical Societies) and in consideration of use of standards, preservation risks, ease of use, differences in approaches to significant properties and sustainability.
Trevor Owens
Unfortunately, I had lots of difficulties using all of them.
There could be 2 distinct goals:
For the first option, you'd better have an array of these tools, plus a couple of paid ones, and try all of them to select the one that provides the best quality on this particular website.
For the second goal, I'd go with wget because it's easiest to script. Maybe also HTTrack, but wget is simpler and probably easier to build a crawling cluster from.
In the medium-sized archives where I used to work, we used web httrack for a couple reasons:
1- it was free, and since web harvesting for preservation was not a regular part of our acquisition policy it didn't make sense to pay for any service
2- it was very simple to run in the browser and the parameters/settings were well-documented online
We did run into problems using it, though. We could only crawl the site we were capturing every 2 days because the process took so long (I assume because of the size of the site and the limitations of our processors, which at that stage were limited to a test machine). When the crawl finished, sometimes the CSS had changed during the course of the processing, so we had to go back and manually update it for some parts of the site -- a very tedious process.
It may well be worth looking at WAIL, which attempts to bundle up Heritix and Wayback to make then easier to use. Making the 'big' tools like Heritrix more usable by more people is a great idea.