Digital Preservation: Lessons Learned?
I find working in digital preservation fascinating.
It’s not where I expected to end up. I started off interested in computing and science, and happened to find out about what was then a fairly young MPhys degree in Computation Physics offered by the University of York. I then did a Ph.D. in Computational Physics at Edinburgh University, working in statistical physics. After that, I spent my time oscillating between being a post-graduate researcher who used large-scale computational methods, and being a computational specialist who helped other scientists make use of those kinds of techniques.
I’d decided to move away from research and get a ‘normal’ industry programmer job, so when we moved to Leeds I applied for a few different positions. One of them turned out to be for the PLANETS Project, based at the British Library. I liked the place and the people, and the work sounded interesting, allowing me to expand my previous experience (not just in computation, but also the information theory that underlies statistical physics) to a new field. And Industry was spared my woolly ways.
I spent a happy few years working on the PLANETS Project and helping kick-off the...
Frontiers in Format Identification
I came to work on digital preservation through the PLANETS project, and later the SCAPE project (for the first year) before moving over to web archiving. These were inspiring projects which achieved a great deal, but we were left with lessons to be learned.
Building Tools to Archive the Modern Web
Four years ago, during the 2012 IIPC General Assembly, we came together to discuss the recent and upcoming challenges to web archiving in the Future of the Web Workshop (see also this related coverage on David Rosenthal’s blog). That workshop made it clear that our tools are failing to satisfy many of these challenges:
- Database driven features
- Complex/variable URI formats
- Dynamically generated URIs
- Rich, streamed media
- Incremental display mechanisms
- Multi-sourced, embedded content
- Dynamic login, user-sensitive embeds
- User agent adaptation
- Exclusions (robots.txt, user-agent, …)
- Exclusion by design (i.e. site architecture intended to inhibit crawling and indexing)
- Server-side scripts, RPCs
- HTML5 web sockets
- Mobile sites
- DRM protected content, now part of the HTML standard
I wish I could stand here and tell you how much great progress we’ve made in the last four years, ticking entries off this list, but I can’t. Although we’ve made some progress, our crawl development resources have been consumed by more basic issues. We knew moving to domain crawling under Legal Deposit would bring big changes in scale, but I’d underestimated how much the dynamics of the crawl workflow would need to change.
News websites are a great example.
Updating our historical search service
Originally published on the UK Web Archive blog on the 15th of February 2016.
Earlier this year, as part of the Big UK Domain Data for the Arts and Humanities project, we released our first ‘historical search engine’ service. We’ve publicised it at IDCC15, the 2015 IIPC GA and at the first RESAW conference, and it’s been very well received. Not only has it lead to some excellent case studies that we can use to improve our services, but other web archives have shown interest in re-using the underlying open source code. In particular, some of our Canadian colleagues have successfully launched webarchives.ca, which lets users search ten years worth of archived websites from Canadian political parties and political interest groups (see here for more details).
But we remained frustrated, for two reasons. Firstly, when we built that first service, we could not cope with the full scale of the 1996-2013 dataset, and we only managed to index the two billion resources up to 2010. Secondly, we had not yet learned how to cope with more than one or two users at a time, so we were loath to...
The provenance of web archives
Originally published on the UK Web Archive blog on the 20th November 2015.
Over the last few years, it’s been wonderful to see more and more researchers taking an interest in web archives. Perhaps we’re even teetering into the mainstream when a publication like Forbes carries an article digging into the gory details of how we should document our crawls in How Much Of The Internet Does The Wayback Machine Really Archive?
Playing at Web Archiving
A few months ago, a colleague suggested that we should come up with ways of helping people learn about the main stages of web archiving, and to help them understand some of the more common technical terminology.
I got a bit carried away…