OPF Blog: Format Obsolescence and Sustainable Access

A new OPF blog entry: Format Obsolescence and Sustainable Access. Reproduced below…

As David Rosenthal pointed out, as long as there is a piece of commercial software or an open source project capable of accessing a format, it cannot be considered truly obsolete. I agree, but I fear this this ‘absolute’ format obsolescence is a poor proxy for the real problem, which is to ensure that our content is not just kept safe, but also remains accessible to our readers both now and in the (near) future. I am perfectly able to compile an open source software application, but I’m not everybody. Indeed, the British Library is committed to providing continuous access for a wide range of people who are almost entirely not me.

David Rosenthal touched upon this issue in an older post, and noted that relying on an open source stack may make things inconvenient for the reader. I think that this is something of an understatement. This is not a matter of mere convenience because the majority of our readers are not able to compile software themselves. This means we need a ‘user friendly’ access channel, and just knowing about a commercial or open source renderer that says it can cope with some given data format does not meet this need. We need to understand how well the available tools actually support the material in our collections, and we need to understand the costs associated with providing these tools as services to our users. I do not believe that these requirements are peculiar to the British Library, as all digital content holders must understand the level of access and support they are required to provide. What is the point of keeping the bits safe if your user community cannot use the content effectively?

Of course, the same argument that David Rosenthal raised about format obsolescence also applies to the cost of providing sustainable access. The majority of the British Library’s content items are in formats like PDF, TIFF and JP2, and these formats cannot be considered ‘at risk’ on any kind of time-scale over which one might reasonably attempt to predict. Therefore, for this material, we take a more ‘relaxed’ approach, because provisioning sustainable access is not difficult.

Unfortunately, a significant chunk of our collection is in formats that are not widely used, particularly when we don’t have any way to influence what we are given (e.g. legal deposit material). Because of this, we have a lot of ’exceptions to the norm’, made up of a lot of different formats and styles of format usage, and while each alone is not that significant as a percentage of the whole, between them they add up to an awful lot of content. This is further complicated by the fact that ‘value’ of the content and the level of access we are committed to providing is not generally correlated with the popularity of the format that the content happens to be stored in. Just understanding the landscape of formats, performance and value represents a major challenge.

The UK Sound Map project provides a good example of the problems we face. The original audio ‘master’ submitted to us arrives in one of a wide range of formats, depending upon the make, model and configuration of the source device (usually a mobile phone). Many of these formats are ’exceptional’, and cannot be relied upon for access now (never mind the future!). How should we cope with this complexity? Well, we already use Broadcast Wave for archiving sound, and we think that this format is a reasonable super-set of the formats that the data arrives in (although perhaps not the embedded metadata). In this case, normalisation to .wav should be feasible to implement, and would massively reduce the cost of preserving access by reducing those variable data carriers down to one. Naturally, we wish to keep the original file so that we can go back to it if necessary, but as the collection is of modest size, this does not represent a major problem in terms of data volume. Thus, a normalisation strategy allows us to preserve the data and make it available in a cost-effective and sustainable manner. This is similar to the approach used by Archivematica.

In short, we are not just preserving content, but also preserving access. This means that the long-term cost of preserving our collection scales not only with the size of the files, but also rises as the number of formats we are required to support is increased. We have some very varied collections, and so we need to understand how many formats we can reasonably support and how this meets our readers needs. For formats that are not widely used, the costs may be so high that the format becomes ‘obsolete’ at the institutional level, and the content is effectively lost as far as the majority of our readers are concerned. Under these circumstances, it may be economically justifiable to perform pre-emptive preservation actions in order to reduce the size of the content or the number and rarity of the formats that we must support.