Story of a Bad Deed
I love a digital preservation mystery, and this one started with question from @joe on digipres.club:
A mystery file, starting with
0x0baddeed, eh? Fascinating. Those hex digits didn’t happen be accident. Using four-digit hex patterns to signal format is an extremely common design pattern, but no authority hands them out – each format designer mints them independently. There must be a story here…
Sustaining the Software that Preserves Access to Web Archives
Today is the inaugural International Digital Preservation Day, and as a small contribution to that excellent global effort I thought I’d write about the current state of the open source tools that enable access to web archives.
Most web archive access happens thanks to the Internet Archive’s Wayback Machine. The underlying software that delivers that service has gone through at least three iterations (as far as I know). The first was written in Perl and was never made public, but is referred to in papers and bits of documentation. The second was written in Java, and was made open source. The third implementation appears to be written in Python and offers some exciting new features, but is not open source. As far as I can tell, the Internet Archive is currently using both the Java and Python versions of Wayback (for the Archive-It service and the global Wayback Machine respectively), but the direction of travel is away from the Java version.
This matters because like most of the web archives in the world, we built our own access system upon the open source version of the Wayback software. Now that...
Driving Crawls With Web Annotations
Originally published on the UK Web Archive blog on the 10th of November 2017.
The heart of the idea was simple. Rather than our traditional linear harvesting process, we would think in terms of annotating the live web, and imagine how we might use those annotations to drive the web-archiving process. From this perspective, each Target in the Web Curator Tool is really very similar to a bookmark on an social bookmarking service (like Pinboard, Diigo or Delicious), except that as well as describing the web site, the annotations also drive the archiving of that site.
In this unified model, some annotations may simply highlight a specific site or URL at some point in time, using descriptive metadata to help ensure important resources are made available to our users. Others might more explicitly drive the crawling process, by describing how often the site should be re-crawled, whether robots.txt should be obeyed, and so on. Crucially, where a particular website cannot be ruled as in-scope for UK legal deposit automatically, the annotations can be used to record any additional evidence that permits us to crawl the site. Any permissions we...
Tools for Legal Deposit
Before I revisit the ideas explored in the first post in the blog series I need to go back to the start of this story…
Between 2003 and 2013 – before the Non-Print Legal Deposit regulations came into force – the UK Web Archive could only archive websites by explicit permission. During this time, the Web Curator Tool (WCT) was used to manage almost the entire life-cycle of the material in the archive. Initial processing of nominations was done via a separate Selection & Permission Tool (SPT), and the final playback was via a separate instance of Wayback, but WCT drove the rest of the process.
Of course, selective archiving is valuable in it’s own right, but this was also seen as a way of building up the experience and expertise required to implement full domain crawling under Legal Deposit. However, WCT was not deemed to be a good match for a domain crawl. The old version of Heritrix embedded inside WCT was not considered very scalable, was not expected to be supported for much longer, and was difficult to re-use or replace because of the way it was baked inside WCT....
Digging Documents Out Of The Archived Web
As an increasing number of government and other publications move towards online-only publication, we are force to move our traditional Legal Deposit processes based on cataloging printed media. As we are already tasked with archiving UK web publications, the question is not so much ‘how to we collect these documents?’ rather ‘how to we find the documents we’ve already collected?’. This presentation will explore the issues we’ve uncovered as we’ve sought to integrate our web archives with our traditional document cataloging processes, especially around official publications and e-journals. Our current Document Harvester will be described, and it’s avantages and limitations explored. Our current methods for exploiting machine-generated metadata will be discussed, and an outline of our future plans for this type of work will be presented.
Can a Web Archive Lie?
This is the script for the introduction I gave as part of a ‘Digital Conversations at the BL’ panel event: Web Archives: truth, lies and politics in the 21st century on Wednesday 14th of June, as part of Web Archiving Week 2017.
My role is to help build a web archive for the United Kingdom that can be used to separate fact from fiction. But do to that, you need to be able to trust that the archived content we present to you is what it purports to be.
Which raises the question: Can a web archive lie?
Well it can certainly be confusing. For example, one seemingly simple question that we are sometimes asked is: How big is the UK web? Unfortunately, this is actually quite a difficult question. First, unlike print, many web pages are generated by algorithms, which means the web is technically infinite. Even putting that aside, to answer this question precisely we’d need to capture every version of everything on the web. We just can’t do that. Even if we could download every version of everything we know about, there’s also the problem of all the sites that we failed to even...