Frontiers in Format Identification
I came to work on digital preservation through the PLANETS project, and later the SCAPE project (for the first year) before moving over to web archiving. These were inspiring projects which achieved a great deal, but we were left with lessons to be learned.
Building Tools to Archive the Modern Web
Four years ago, during the 2012 IIPC General Assembly, we came together to discuss the recent and upcoming challenges to web archiving in the Future of the Web Workshop (see also this related coverage on David Rosenthal’s blog). That workshop made it clear that our tools are failing to satisfy many of these challenges:
- Database driven features
- Complex/variable URI formats
- Dynamically generated URIs
- Rich, streamed media
- Incremental display mechanisms
- Multi-sourced, embedded content
- Dynamic login, user-sensitive embeds
- User agent adaptation
- Exclusions (robots.txt, user-agent, …)
- Exclusion by design (i.e. site architecture intended to inhibit crawling and indexing)
- Server-side scripts, RPCs
- HTML5 web sockets
- Mobile sites
- DRM protected content, now part of the HTML standard
I wish I could stand here and tell you how much great progress we’ve made in the last four years, ticking entries off this list, but I can’t. Although we’ve made some progress, our crawl development resources have been consumed by more basic issues. We knew moving to domain crawling under Legal Deposit would bring big changes in scale, but I’d underestimated how much the dynamics of the crawl workflow would need to change.
News websites are a great example.
Updating our historical search service
Originally published on the UK Web Archive blog on the 15th of February 2016.
Earlier this year, as part of the Big UK Domain Data for the Arts and Humanities project, we released our first ‘historical search engine’ service. We’ve publicised it at IDCC15, the 2015 IIPC GA and at the first RESAW conference, and it’s been very well received. Not only has it lead to some excellent case studies that we can use to improve our services, but other web archives have shown interest in re-using the underlying open source code. In particular, some of our Canadian colleagues have successfully launched webarchives.ca, which lets users search ten years worth of archived websites from Canadian political parties and political interest groups (see here for more details).
But we remained frustrated, for two reasons. Firstly, when we built that first service, we could not cope with the full scale of the 1996-2013 dataset, and we only managed to index the two billion resources up to 2010. Secondly, we had not yet learned how to cope with more than one or two users at a time, so we were loath to...
The provenance of web archives
Originally published on the UK Web Archive blog on the 20th November 2015.
Over the last few years, it’s been wonderful to see more and more researchers taking an interest in web archives. Perhaps we’re even teetering into the mainstream when a publication like Forbes carries an article digging into the gory details of how we should document our crawls in How Much Of The Internet Does The Wayback Machine Really Archive?
Playing at Web Archiving
A few months ago, a colleague suggested that we should come up with ways of helping people learn about the main stages of web archiving, and to help them understand some of the more common technical terminology.
I got a bit carried away…
Let Them Emulate!
On the first day of the IIPC GA 2015, the morning keynote was Digital Vellum: Interacting with Digital Objects Over Centuries, presented by Vint Cerf and Mahadev Satyanarayanan. This included some more details and demonstrations of the proposed preservation solution I blogged about before, so I thought it was worth returning to the subject now that I know a little more about it.