OPF Blog: What do we mean by format?

A new OPF blog entry: What do we mean by format? Reproduced below...

Bill’s earlier post and this one from Chris Rusbridge have spurred me to try to describe what I discovered about PRONOM format records during my editable registry experiment. Building that site required a close inspection of the PRONOM Format Record data model, during which I realised that we commonly conflate two quite different ways of defining formats. I suspect we should start to tease them apart.

The two definitions are:

Format, as it is specified. e.g. files that conform to the PDF 1.4 specification.
Format, as it is used. e.g. PDF files created by Adobe Illustrator CS2.

Flicking through the existing PRONOM records, it is clear the majority of the most complete records are defined by reference to specification. Many of the emptiest record correspond to known software, but with poor documentation. In between, the records are mostly thin wrappers around simple names, known internal signatures and file extensions. These different flavours of records have no consistently overlapping data fields that can be considered ‘primary keys’, i.e. that uniquely define a format. In other words, we don’t know what a format is.

If we are not sure which fields define a format, then I fear that the PRONOM team’s primary focus on creating signatures rather than documenting formats going to sting us in the long term. This is because the lack of clarity about what it is we are identifying will mean we may risk, for example, accidentally conflating different formats, or making artificial distinctions between differently named versions of the same format. We are minting identifiers for ambigious concepts, and so we must expect those identifiers to be retracted or replaced at some point in the future. What does it mean mint a permanent identifier for a record when every single aspect of that record is permitted to change?

One alternative to the PRONOM model is the GDFR approach, which defines a format as “a serialized encoding of an abstract information model”, and provides a sophisticated four-level model of what that means:

Abstract information model

…mapped via Format encoding model to the…

Coded information set (semantic)

…mapped via the Format encoding form to the…

Structural information set (syntactic)

…mapped via the Format encoding scheme (parser/encoder) to the…

Serialized byte stream

The problem is that not all format specifications have these four levels. The levels were inspired by the Unicode character encoding model, but (as that document itself indicates) other specifications uses different numbers of levels. RDF has three, HTML5 has three that define the markup semantics but uses more levels to link the mark-up to the behaviours and other features of the interpretation/renderer. Furthermore, formats defined only by software have only the lowest rungs of this scheme (data and parser/encoder). Such formats have no abstract information model, just an in-memory representation and an interpretation/performance. Even this mapping conflates the formal specification of the parser/encoder with it’s implementation – if we are being perfectly strict, the only thing the two perspectives have in common is the bytestream itself.

Conflating these different ways ways of defining format makes it difficult to describe the cases where conflict arises. We have probably all come across files that are perfectly well handled by the software, but break the specification, or indeed formats that have no formal specification. We need to be able to describe these difficult cases. Perhaps we should we be minting distinct identifiers for format specifications and format implementations instead? This could be done by deferring to the specification document instead of trying to model it’s contents, and would still allow us to distinguish between a bitstream that conforms to a given standard and a bitstream that is parseable using a particular implementation of that standard.

I think PRONOM are aware of the limitations of their model, but are going to go ahead and get the data out first anyway. Simultanously, it looks like UDFR are proceeding with their own ontology, presumably based on the GDFR model. In general, I think just pushing the data out there first (a.k.a. raw data now) is a reasonable approach, because we can always review and consolidate later, and doing it this way around helps ensure that the consolidation is based on real data. But I can’t shake the feeling that we are taking the long way round.