OPF Blog: Biodiversity and the registry ecosystem

A new OPF blog entry: Biodiversity and the registry ecosystem. Reproduced below...

As Paul has already noted, there are a a number of new efforts to crowdsource format information. Superficially, this might look like a duplication of effort, but I don’t think that needs to be the case. In fact I think they could fit together rather neatly, and combine to produce exactly the kind of diversity of registries that we need. Together, they could preserve the representation information we need by operating at three distinct levels, each more sophisticated than the last…

Level 1 – Make it safe

This is the most basic level at which we can preserve representation information. For example, if we find out about a new format, we might seek out the specification or reference implementation in order to understand it better. The first thing we should do is take a copy of those resources and archive them. As long as the source materials are safe, we know we can come back to them later and work out what they mean.

I know a number of institutions do this locally, ensuring that important format documentation like formal specifications are archived alongside the material. However, due to the number of formats involved, even this lowest level of preservation is in danger of being rather fragmented, and as it’s all done in private, we can’t tell how complete the overall record is. To address this, we set up the Crowd-sourced Representation Information for Supporting Preservation project, making it easy to pool references to important documentation and get those documents archived as soon as possible.

Level 2 – Make it useful

The next level is to combine and curate this information, in order to address the specific problems we have. In the case of Wikipedia:Computer file formats, this means informing a general audience about common formats and the software that can use them (this is also true for the various file extension look-up sites). For the Just Solve The Format Problem project and the Multimedia Wiki, the audience is more technical, and the primary goal is to ensure we can make use of old files. For the LoC Sustainability of Digital Formats website, the goal is broader, and the information is aimed at digital preservation ‘professionals’.

Critically, this is where we work out what’s worth keeping, and why, and make it easier to find and use. Of course, this creates new information resources that we’ll want to keep safe.

Note that all of these resources are intended primarily for human consumption. Even the IANA MIME Type registry, the format registry that powers the web, boils down to a set of web pages. This is not simply because writing and reading prose is easier than modelling, it’s also the fastest way to get information into a human brain.

Level 3 – Make it powerful

Now we understand what information we need and why, we can start to model that data and encode it in machine-readable forms. This should be considered an optimisation, enabling us to automate preservation processes, and then execute them at scale. The classic preservation example would be PRONOM, where fine-grained information on formats is used to power the DROID identification tool, sparing us from having to manually inspect every file in order to work out what we’ve got. The most sophisticated preservation example is perhaps the XCL toolset, which tried to capture both the syntax and semantics of digital objects in three dedicated languages.

The various format registry projects, such as UDFR, The CASPAR/DCC Representation Information Repository, The Software Ontology, The National Software Reference Library and KEEP TOTEM – the Trustworthy Online Technical Environment Metadata all belong in this group. As do a number of initiatives from the broader community, such as the file and Apache Tika format identification tools,DBpedia and Freebase.

Of course, these fine-grained models require a large amount of effort, not just to define the models, but more importantly, to fill them out with the data for all the formats. Therefore, we should not be surprised if the volume of information held in these systems lags behind that documented in prose, just as the volume of information in DBpedia necessarily lags behind that in the entirety of Wikipedia. It should also be noted that it is difficult to justify the effort involved in modelling all the available data unless one can envisage an automated process that might exploit it.

Crossing the streams

Practically, these different levels intermix and feed into each other. For example, the LoC crawls the contents of their Sustainability of Digital Formats pages, turning it into a web archive, and thus making something useful safe. Similarly, if the Just Solve The Problem project works, it will be added to the list of resources on the CRISP list (superseding those references that are duplicated in both).

But the difference between the levels remains important. Each is more fine-grained than the last, with less prose and more machine-readable data, and filling out this information requires more and more effort along the way. This why we have so many nearly-empty registries. We have kept building new systems, each leaping straight to Level 3, but have not addressed the issue of who will use them, and how they fit into the larger information ecosystem. Realistically, we must expect to support all of this representation information at all the different levels, wherever it happens to be stored.

But the distributed and diverse reality of the format registry ecosystem brings with it the dangers of incompleteness and inconsistency, both in terms of coverage and correctness. It also raises the concern that any effort we invest in any particular source of information might be wasted, which acts as a major barrier to contribution. To address these issues, I believe the digital preservation community needs to step up and add one more level. One that touches upon all the others.

Level 4 – Make it right

Full confidence in the quality of the information in these resource requires human effort, but it is possible to focus this effort where it is most needed by exposing the gaps and conflicts between different data sources. The trick is to aggregate the information from all the available sources, so that it can be assessed together. For example, we could pull together all of the information on the PDF format held in all the different registries, normalising the representations as much as possible, thus allowing the overall coverage of this format to be ascertained. Conflicts could be reviewed and then resolved by submitting updated information back to the original source(s), using the results of the aggregation as evidence to support that resolution.

The SCAPE Project has been attempting to validate this approach for a very specific case – format identification data. By running a range of identification tools across large sets of test documents, (see e.g. this blog post), the information in the different tools and registries can be compared against one another. Over time, increasing consensus, consistency and coverage can then be used to measure progress. These tactics also encourage information to be shared across different sources, increasing the chances that each contribution will be valued and sustained.

So, how will you spend November?

The practical upshot of all this is that we should stop worrying about which effort to contribute to and instead contribute to whichever suites our individual needs and abilities. If you’ve just got a link that you know is important, send it to CRISP. If you want to contribute detailed, fine-grained information that can be used for automated planning, go fill out UDFR. If you want to help us identify more formats more reliably, contribute format signatures to DROID, or Apache Tika, and maybe make some test files available. If you want to help techies understand old formats, join me on Jason’s Bandwagon. If you want to help less technical audiences understand the general issues, contribute to Wikipedia Digital Preservation Project. If you are uncertain how to contribute, or want to help aggregate all this data to validate it, talk to us.

It all counts. It won’t be wasted. It will all help.

Let’s make it work. Let’s build an army.