User-Driven Digital Preservation

First published on the UK Web Archive blog.

When we archive the web, we want to do our best to ensure that future generations will be able to access the content we have preserved. This isn’t only a matter of keeping the digital files safe and ensuring they don’t get damaged. We also need to worry about the software that is required in order to access those resources.

A Digital Dark Age?
#

For many years, the spectre of obsolescence and the ensuing digital dark age has driven much of the research into digital preservation. However, even back in 2006, Chris Rusbridge was arguing that this concern was overblown, and since at least 2007, David Rosenthal has been arguing that this kind of obsolescence is no longer credible.

What Are The Risks?
#

The current consensus seems to have largely (but not universally) shifted away from perceiving obsolescence as the main risk we face. Many of us expect the vast majority of our content to remain accessible for at least one or two decades, and believe any attempt to predict the future of information technology beyond the next twenty years should be taken with an extremely large pinch of salt. In the meantime, we are likely to face much more basic issues concerning the economics of storage, and concerning the need to adopt scalable collection management techniques to ensure the content we have remains safe, discoverable, and is accompanied by the contextual information it depends upon.

This is not to say obsolescence is no risk at all, but rather that the scale and urgency of the problem are uncertain. Therefore, in order to know how best to take care of our digital history, we need to find ways of gathering more evidence about this issue.

Understanding Our Collections
#

One aspect of this is to analyse the content we have, and try to understand how it changes over time. Examples of this kind of work include our paper on Formats Over Time (more here), and more recent work on embedding this kind of preservation analysis in our full-text indexing processes so we can explore these issues more readily.

But studying the content can only tell you half the story - for the other half, we need to work out what we mean by obsolescence.

Understanding Obsolescence
#

If there is an open source implementation of a given format, then we might say that format cannot be obsolete. But if 99.99% of the visitors to the UK Web Archive are not aware of that software (and even if they were, would not be able to compile and build it in order to access the content), is that really an accurate statement? If the members of our designated community can’t open it, then it is surely obsolete, whether or not someone somewhere has all the time, skills and resources needed to make it work.

Obsolescence is a user experience problem that ends with frustration. So how can we better understand the needs and capabilities of our users, to enable them to help drive the digital preservation process?

How Interject Can Help
#

To this end, and working with the SCAPE Project, we have built a service that demonstrates how we might find the content that users are having difficulties with and, where possible, provide alternative ways of accessing that content. This prototype service, called Interject, explores how a mixture of user-feedback and preservation actions can be smoothly integrated into the search infrastructure of the UK Web Archive, by acting as an ‘access helper’ for end users.

ZX Spectrum Software
#

For example, if you go to our historical search prototype and look for a specific file called ’lostcave.z80’ you’ll see the Internet Archive has a number of copies of this old ZX Spectrum game but, unless you have an emulator to hand, you won’t be able to use them. However, if you click ‘Use our access helper’, the Interject service will inspect the resource, summarise what we understand about it, and where possible offer transformed versions of that resource. In the case of ’lostcave.z80’, this includes a full browser-based emulation so that you can actually play the game yourself. (Note that this example was partially inspired by the excellent work on browser-based emulated access being carried out by the Internet Archive).

The Interject service can offer a range of transformation options for a given format. For example, instead of running the emulator in your browser, the service can spin up an emulator in the background, take a screenshot, and then deliver that image back to the you, like this:

Lost Cave (dynamically generated screenshot)

These simple screenshots are not quite as impressive as the multi-frame GIFs created by Jason Scott’s Screen Shotgun (with more results available here), but they do illustrate the potential a simple web API that transforms content on demand.

Early Image Formats
#

As the available development time was relatively short, we were only able to add support for a few ‘difficult’ formats. For example, the X BitMap image format was the first image format on the web. However, despite this early and important role this format and the related X PixMap format (for colour images) are not widely supported today, and so require format conversion in order to enable access. Fortunately, there are a number of open source projects that support these formats, and Interject makes them easy to use. See for example image.xbm, xterm-linux.xpm and this embedded equation image shown below as a more modern PNG:

VRML
#

We also added support for VRML1 and VRML97, two early web-based formats for 3D environments that required a browser plugin to explore. Those plugins are not available for modern browsers, and the formats have been superseded by the X3D format. Unfortunately these formats are not backward compatible with each other, and tool support for VRML1 is quite limited. However, we were able to find suitable tools for all three formats, and using Interject, we are able to take a VRML1 file, and then combine a two format conversions (VRML1-to-VRML97 and VRML97-to-X3D) before passing the result to a browser-based X3D renderer, like this.

The Future Of Interject
#

Each format we decide to support adds an additional development and maintenance burden, and so it is not clear how sustainable this approach will be in the long term if we work alone. This is one of the reasons why Interject is open source, and we would be very happy to receive ideas or contributions from other individuals and organisations.

Letting Users Lead The Way
#

But even without any transformation services, the core of this idea is to find ways to listen to our users, so we have some chance of finding out what content is ‘obsolete’ to them. By listening when they ask for help, and by allowing our visitors to decide between the available options, the real needs of our designated communities can be expressed directly to us and so taken into account as part of the preservation planning process.

— OPF

We recently posted an article on the UK Web Archive blog that may be of interest here: User-Driven Digital Preservation

It summarises our work with the SCAPE Project on a little prototype application that explores how we might integrate user feedback and preservation actions into our usual discovery and access processes. The idea is that we need to gather better information about which resources are difficult for users to use, and which formats they would prefer, so that we can use this data to drive our preservation work.

The prototype also provides a convenient way to run Apache Tika and DROID on any URL, and exposes the contents of its internal 'format registry' as a set of web pages that you can browse through (e.g. here's what it knows about text/plain). It only supports a few preservation actions right now, but it does illustrates what might be possible if we can find a way to build a more comprehensive and sustainable system.

A Digital Dark Age?#

What Are The Risks?#

Understanding Our Collections#

Understanding Obsolescence#

How Interject Can Help#

ZX Spectrum Software#

Early Image Formats#

VRML#

The Future Of Interject#

Letting Users Lead The Way#