When does it make sense to use different web archiving tools?

In particular, I'm thinking about Archive-it, HTTrack, Heritrix, and WGet. I would be particularly interested in which tools make sense in small, medium and large cultural heritage organizations (Libraries, Archives, Museums, Historical Societies) and in consideration of use of standards, preservation risks, ease of use, differences in approaches to significant properties and sustainability.

Trevor Owens

software
web-archiving
open-source
ingest

Comments

Answer by wizzard0

Unfortunately, I had lots of difficulties using all of them.

There could be 2 distinct goals:

preserve few sites as well as possible
preserve a lot of sites with "best effort" quality.

For the first option, you'd better have an array of these tools, plus a couple of paid ones, and try all of them to select the one that provides the best quality on this particular website.

For the second goal, I'd go with wget because it's easiest to script. Maybe also HTTrack, but wget is simpler and probably easier to build a crawling cluster from.

Comments

Answer by Courtney C. Mumma

In the medium-sized archives where I used to work, we used web httrack for a couple reasons:

1- it was free, and since web harvesting for preservation was not a regular part of our acquisition policy it didn't make sense to pay for any service

2- it was very simple to run in the browser and the parameters/settings were well-documented online

We did run into problems using it, though. We could only crawl the site we were capturing every 2 days because the process took so long (I assume because of the size of the site and the limitations of our processors, which at that stage were limited to a test machine). When the crawl finished, sometimes the CSS had changed during the course of the processing, so we had to go back and manually update it for some parts of the site -- a very tedious process.

Comments

Answer by Andy Jackson

It may well be worth looking at WAIL, which attempts to bundle up Heritix and Wayback to make then easier to use. Making the 'big' tools like Heritrix more usable by more people is a great idea.