Lessons Learned

on Digital Preservation

Story of a bad deed

The CMX 600 of Digital Content Management

In 1971 the CMX 600 began a revolution by being the very first non-linear video editing system.

(see snapshots on desktop)

Lessons Learned

What a good file format looks like (from a preservation perspective):

Web Packging Exists!

Language as critical as it/dp boundsary is tense Format obsolescence is not as urgent and terrible as advertised Bit preservation is not solved We don’t need registries for truth We don’t need registries for what to do, we need each other. It takes a village…

Prescriptive v Descriptive Linguistics If we only accept valid PDFs, we are saying. We know best. We understand PDF and we know what’s worth preserving. In this analogy: We know what’s best. We understand the Queen’s English and any documents will need to be translated before we archive them HERE. Speculation on easy/’preservable’ formats and correlated use with different social/economic subgroups.

Every format has two faces/parsing is hard

When you press save, what happens? Lots of decision. It’s not just the document. What about the undo buffer? What about the window position? What about the language? These can be installation/user/file level, it depends.

Format standardisation is precisely a designated community coming together to determine which properties of a Digital (Performance) Object they want to preserve over time.

Compression is Fraught: I wonder if some of those opposed to compression also avoid using databases as master stores for critical metadata?

Sheridan Code, versus the Models. Man made, use the language of what it is. No (further) abstraction necessary and in fact it gets in the way. NLA model only good because it gets closer.

Herbert CNI talk. Scholarly communication fragments. Atmosphere at iDCC? How physics worked at different places. How biology worked. Thoughts on adoption. Thoughts on costs.

OAIS is for talking about digital preservation, not for doing it.‬ I think OAIS is better suited to talking about doing digital preservation than helping get preservation done. It deliberately floats above ‬

Flexibility versus the principle of least power

SIPs as DIPs etc https://twitter.com/euanc/status/922520776384962560 But it’s not a DIP if it’s not leaving your org

‪ MPLP and layers?

But constant work is required to generate the illusion of a stable image. Fifty times a second, the ULA reads the screen memory and sends the result out to the television screen, and this process is interlaced with the data loading as the whole thing hums along at 3.5MHz (about a thousand times slower than a modern machine).

OODT is closest thing to ref impl of OAIS

A light-weight pre-premis/mets would be very useful. Extend bagit with orthogonal data in linewise files

Open source and digital preservation

Poor cohesion of reading lists

More automation possibilities e.g. UI script Acrobat Reader to test, I.e. Simulate user interactions in

Validation, nope Validation, fast and thorough

How to help practitioners help?! http://anjackson.net/2016/06/08/frontiers-in-format-identification/#comment-2723081343

Re-deriving significant properties Note that OAIS Ingest as odd notion Normalisation as prejudice

It’s also been interesting to compare the web archiving community with the broader digital preservation community. There’s many familiar faces due to the strong overlap between the fields, but there’s also a stronger sense of a unified vision, a preference for practical results, and a more constructive colllaboration between researchers and content-holding organisations. On the flip-side, there is something of a silo effect, where web archive material is often poorly integrating into the bigger picture, both in the abstract (e.g. the bigger picture of digital preservation) and the concrete (e.g. we’re only just learning how to integrate web archives with our other holdings).

dd if=/dev/zero bs=1M count=1000 | openssl dgst -sha512
# dd if=/dev/zero bs=1M count=1000 | openssl dgst -sha512
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.94019 s, 266 MB/s
(stdin)= a7d483bb9af2ca4b064420d1911d9116b6b609ca312fd7ed919fc1b8be7d1eb57c46f2a6f13380b6dc38f024d17442b4c7b8ecb8c121dc88227d588fc2e04297
# hdparm -tT /dev/sda1
 Timing cached reads:   17696 MB in  2.00 seconds = 8861.16 MB/sec
 Timing buffered disk reads: 332 MB in  3.01 seconds = 110.42 MB/sec
[root@crawler06 python-shepherd]# dd if=/dev/zero bs=1M count=1000 > /dev/null
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.0997883 s, 10.5 GB/s


Impossible Standards

c.f. DP benchmarking paper

Is it possible? i.e.

Even if is possible, it is feasible?


Fighting entropy since 1993

© Dr Andrew N. Jackson — CC-BY