Continuous, incremental, scalable, higher-quality web crawls with Heritrix
Under Legal Deposit, our crawl capacity needs grew from a few hundred time-limited snapshot crawls to the continuous crawling of hundreds of sites every day, plus annual domain crawling. We have struggled to make this transition, as our Heritrix set-up was cumbersome to work with when running large numbers of separate crawl jobs, and the way it managed the crawl process and crawl state made it difficult to gain insight into what was going on and harder still to augment the process with automated quality checks. To attempt to address this, we have combined three main tactics; we have moved to containerised deployment, reduced the amount of crawl state exclusively managed by Heritrix, and switched to a continuous crawl model where hundreds of sites can be crawled independently in a single crawl. These changes have significantly improved the quality and robustness of our crawl processes, while requiring minimal changes to Heritrix itself. We will present some results from this improved crawl engine, and explore some of the lessons learned along the way.
Since we shifted to crawling under Legal Deposit in 2013, the size and complexity of our crawling has massively increased. Instead of crawling a...
Story of a Bad Deed
I love a digital preservation mystery, and this one started with question from @joe on digipres.club:
A mystery file, starting with
0x0baddeed, eh? Fascinating. Those hex digits didn’t happen be accident. Using four-digit hex patterns to signal format is an extremely common design pattern, but no authority hands them out – each format designer mints them independently. There must be a story here…
Sustaining the Software that Preserves Access to Web Archives
Today is the inaugural International Digital Preservation Day, and as a small contribution to that excellent global effort I thought I’d write about the current state of the open source tools that enable access to web archives.
Most web archive access happens thanks to the Internet Archive’s Wayback Machine. The underlying software that delivers that service has gone through at least three iterations (as far as I know). The first was written in Perl and was never made public, but is referred to in papers and bits of documentation. The second was written in Java, and was made open source. The third implementation appears to be written in Python and offers some exciting new features, but is not open source. As far as I can tell, the Internet Archive is currently using both the Java and Python versions of Wayback (for the Archive-It service and the global Wayback Machine respectively), but the direction of travel is away from the Java version.
This matters because like most of the web archives in the world, we built our own access system upon the open source version of the Wayback software....
Driving Crawls With Web Annotations
Originally published on the UK Web Archive blog on the 10th of November 2017.
The heart of the idea was simple. Rather than our traditional linear harvesting process, we would think in terms of annotating the live web, and imagine how we might use those annotations to drive the web-archiving process. From this perspective, each Target in the Web Curator Tool is really very similar to a bookmark on an social bookmarking service (like Pinboard, Diigo or Delicious), except that as well as describing the web site, the annotations also drive the archiving of that site.
In this unified model, some annotations may simply highlight a specific site or URL at some point in time, using descriptive metadata to help ensure important resources are made available to our users. Others might more explicitly drive the crawling process, by describing how often the site should be re-crawled, whether robots.txt should be obeyed, and so on. Crucially, where a particular website cannot be ruled as in-scope for UK legal deposit automatically, the annotations can be used to record any additional evidence that permits us to crawl the...
Tools for Legal Deposit
Before I revisit the ideas explored in the first post in the blog series I need to go back to the start of this story…
Between 2003 and 2013 – before the Non-Print Legal Deposit regulations came into force – the UK Web Archive could only archive websites by explicit permission. During this time, the Web Curator Tool (WCT) was used to manage almost the entire life-cycle of the material in the archive. Initial processing of nominations was done via a separate Selection & Permission Tool (SPT), and the final playback was via a separate instance of Wayback, but WCT drove the rest of the process.
Of course, selective archiving is valuable in it’s own right, but this was also seen as a way of building up the experience and expertise required to implement full domain crawling under Legal Deposit. However, WCT was not deemed to be a good match for a domain crawl. The old version of Heritrix embedded inside WCT was not considered very scalable, was not expected to be supported for much longer, and was difficult to re-use or replace because of the way it was baked inside WCT.
Digging Documents Out Of The Archived Web
As an increasing number of government and other publications move towards online-only publication, we are force to move our traditional Legal Deposit processes based on cataloging printed media. As we are already tasked with archiving UK web publications, the question is not so much ‘how to we collect these documents?’ rather ‘how to we find the documents we’ve already collected?’. This presentation will explore the issues we’ve uncovered as we’ve sought to integrate our web archives with our traditional document cataloging processes, especially around official publications and e-journals. Our current Document Harvester will be described, and it’s avantages and limitations explored. Our current methods for exploiting machine-generated metadata will be discussed, and an outline of our future plans for this type of work will be presented.