OPF Blog: Breaking Down The Format Registry

A new OPF blog entry: Breaking Down The Format Registry. Reproduced below the fold.

At the hackathon it was clear that the identification discussion started by Fido represented an archetypal example of why this community wants to work together. No matter what the institution, whatever the context or workflow, we all need reliable tools for identifying files and formats. Of course, reliable identification requires reliable identifiers, and so the discussion about the tools is necessarily intertwined with the idea of a format registry.

The problem is that the format registry concept comes with quite a lot of baggage. The idea has grown to include all sorts of different information we think we might want to share, and all sorts of information and processes we imagine might be useful. For example, the PRONOM data model is not just about minting format identifiers, but has grown to include information on format specification documents and other representation information, internal and external signatures, software and tools, and the properties of formats. This trend can be seen in the UDFR plans and the Planets Core Registry developments as well as the PRONOM data structures.

Similarly, in terms of processes, the various designs and/or implementations have grown to include more than just publishing format information on the web (as per PRONOM). We want to make the data editable over the web (e.g. Planets Core Registry), allow synchronisation and/or merging data across format registries from various sources (design plans for GDFR/UDFR, PCR), and publish and consume the format data as Linked Data. (e.g. p2 and the Pronom reboot).

These are quite ambitions aims, and we only have limited resources to achieve them. So how should we proceed? Firstly, of course, we need to look closely at our requirements and work out what we need. But, when that is clearer, how to we go about implementing the solution?

I don't have an answer, but as we decide what to do, I want to make it clear that although we may associated all of this complex functionality under a single 'Format Registry' banner, this does not, and should not, imply that one software system must cover all of these requirements. Indeed, I believe a much stronger implementation can be built by combining established tools and pinning down how they should interact with each other.

For example, my current thinking on the data model is that the format registry design should only be concerned with minting identifiers for formats, and collecting the minimal representation information require to support this. I think that if we can get this right, then we can use those identifiers to describe the other entities, like software tools and digital object characteristics, but without forcing them all into the same system. The XCL tools, Planets Ontology, JHOVE and JHOVE2 projects have been trying to capture object characteristics for some time, and I would argue that it is still not clear how best to proceed. Similarly, KEEP is putting a lot of effort into working out how to collecting information on tools and software environments. We cannot accurately prejudge what the mature approach from these efforts will be, and we should not underestimate or duplicate the effort involved. However, if we can all share the same format identifier scheme(s) (as we do now!), then we could build separate 'registries' for the information coming out of all that work without binding the different systems together into one big framework and trying to keep it all up to date.

Similarly, if the core format concepts and identitification mechanism can be pinned down, it will make it a lot easier to pull in information from other sources. PRONOM is hardly the only thing out there that you could call a 'format registry'. We have the immensely successful MIME Media Type registry, as well as the Library of Congress format registry, file/libmagic, Wikipedia/DBpedia and Freebase. It would be great to be able to take advantage of some of the work that other communities are doing in this are.

Returning to the registry processes and functionality, I think that by adopting a simple XML format for the format data (probably based on the PRONOM Report schema), we manage the data using off-the-shelf tools. For example, if we consider the format data as content (as Adam was arguing), then I think we can use an off-the-shelf CMS as a web interface for collaborative editing and reviewing of the format data. As long as this data can be pulled and pushed into the CMS as XML, we can manage the 'archival copy' of the format registry data as XML in a version control repository. Using a distributed VCS like git would make it relatively simple to manage, merge and synchronise different sources of data. Again, very little coding would be required to make this happen - just enough to tie the versioning system and the CMS together via the standardised XML format. Then, having established an editoral process based on a CMS and VCS, the 'production' version of the data could be published for wider consumption as XML and Linked Data, e.g. via the p2 registry.

I've been experimenting with using a web CMS to manage the format data and the basic editorial workflow, and I believe it presents a practical way forward. I'll be posting a follow-up blog on how that might work over the next few days.

Of course, the real work is in working out what infomation we want to share, what we want to do with it, and how we're going to validate it. I await Bill's work on these issue with keen interest. Once we do have more solid requirements, I hope we look to see how much we can achieve by combining tools that already exists, rather than re-inventing a whole load of wheels.