Language as critical as it/dp boundsary is tense
Format obsolescence is not as urgent and terrible as advertised
Bit preservation is not solved
We don’t need registries for truth
We don’t need registries for what to do, we need each other. It takes a village…
Prescriptive v Descriptive Linguistics
If we only accept valid PDFs, we are saying. We know best. We understand PDF and we know what’s worth preserving. In this analogy: We know what’s best. We understand the Queen’s English and any documents will need to be translated before we archive them HERE.
Speculation on easy/’preservable’ formats and correlated use with different social/economic subgroups.
Every format has two faces/parsing is hard
When you press save, what happens? Lots of decision. It’s not just the document. What about the undo buffer? What about the window position? What about the language? These can be installation/user/file level, it depends.
Format standardisation is precisely a designated community coming together to determine which properties of a Digital (Performance) Object they want to preserve over time.
But constant work is required to generate the illusion of a stable image. Fifty times a second, the ULA reads the screen memory and sends the result out to the television screen, and this process is interlaced with the data loading as the whole thing hums along at 3.5MHz (about a thousand times slower than a modern machine).
OODT is closest thing to ref impl of OAIS
A light-weight pre-premis/mets would be very useful.
Extend bagit with orthogonal data in linewise files
Open source and digital preservation
Poor cohesion of reading lists
More automation possibilities e.g. UI script Acrobat Reader to test, I.e. Simulate user interactions in
Validation, fast and thorough
How to help practitioners help?!
Re-deriving significant properties
Note that OAIS
Ingest as odd notion
Normalisation as prejudice
It’s also been interesting to compare the web archiving community with the broader digital preservation community. There’s many familiar faces due to the strong overlap between the fields, but there’s also a stronger sense of a unified vision, a preference for practical results, and a more constructive colllaboration between researchers and content-holding organisations. On the flip-side, there is something of a silo effect, where web archive material is often poorly integrating into the bigger picture, both in the abstract (e.g. the bigger picture of digital preservation) and the concrete (e.g. we’re only just learning how to integrate web archives with our other holdings).
- Architects and Engineers
- Environment, neutral/abstract versus specific
- Lingia Franca - No-ones language versus language of work. Esperanto or English.
- Structured data verses entensible+free form.
- Waterfall versus Agile/Ingest and Iterate.
PDF encryption is weird
- how fast can you hash?
dd if=/dev/zero bs=1M count=1000 | openssl dgst -sha512
# dd if=/dev/zero bs=1M count=1000 | openssl dgst -sha512
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.94019 s, 266 MB/s
# hdparm -tT /dev/sda1
Timing cached reads: 17696 MB in 2.00 seconds = 8861.16 MB/sec
Timing buffered disk reads: 332 MB in 3.01 seconds = 110.42 MB/sec
[root@crawler06 python-shepherd]# dd if=/dev/zero bs=1M count=1000 > /dev/null
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.0997883 s, 10.5 GB/s
- Intros to using PDFDebugger and iText RUPS
- Backwards and forwards compatibility and validation.
- http://www.pdfa.org/wp-content/uploads/2014/12/PDF_A_JHOVE_Friese_28112014_en1.pdf http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/
- Big metadata is a bad smell
- Identification changes over time, PUIDs are not forever
- Formats identification is a guessing game
- It starts with Save as…
- ‘Load’ is not the opposite of ‘save’
- Don’t say Digital Object
- We Preserve Processes
- ‘Significant Properties’ Won’t Save Us
- Gap between ‘Significant Properties’ and significant properties.
- Things that work don’t actually describe the performance or the process.
- What are the significant characteristics of the Mona Lisa?
- Formats Define Significant Properties
- Pretending that software is not central is only possible because of all the work that went into making interoperable formats.
- Identification Links Bitstreams to Software
- Formats are behaviours, not properties.
- Validation is unnecessary
- Not always possible, Halting Problem etc.
- Note fast-fail versus linting modes.
- We need mecha suits
- Film scratch removal and other digital re-mastering tools.
- Emulation is a type of migration
- Format obsolescence is not the biggest risk we face
- The biggest preservation risk is economics.
- Storage isn’t solved.
- We don’t know what we’ve got
- SCAPE meeting need for better characterisation
- The biggest preservation risk: unsustainable investment
- Obsolescence is obsolete
- Vendors drive it.
- Open source avoids it, largely.
- Compare with hardware obsolescence.
- We can learn from Adobe Lightroom
- The first AIP should be the SIP (plus just enough provenance to know how/why you have it)
- The AIP Is Never Finished
- OAIS Is Not Enough (models a preserving organisation not preservation itself)
- OAIS Is Still Not Enough (pre-ingest/inner ring - models a repository or an organisation?)
- Pre-Ingest Does Not Exist
- Issue with OAIS: SIP AIP DIP are names based on context, and OAIS is a bit fuzzing about the context
- Flexible, the Archive might be part of an organisation, or the whole organisation.
- Provenance does not guarantee trust
- Digital Preservation Is Not A Science
- Like the use of the work Tool, the use of the work Theory is confusing.
- ‘Theory’, but this is more like literature.
- Also crucial importance of value judgements
- Scientific approach is appropriate in many cases
- Emergent behaviour is complex
- But it might look more like medical science
- Format identification needs parent-child, grammar
- .lck (zero bytes) https://twitter.com/britpunk80/status/588278017580990465
- We don’t need corpora to be annotated to learn something useful
- We need a format grammar
- Text encoding example, codecs etc
- MIME is great, missing versions but can be extended.
- We need to understand the barriers to collaboration
- format registries
- Corpora first
- No shame in bit-preservation only
- But keep some renderings, plus software gap and dependencies
- Platform dependencies == none (standard elements only)
- JPEG is okay
- Would you rather have 600dpi JPEG or a 300dpi TIFF?
c.f. DP benchmarking paper
Is it possible? i.e.
- can it cope with lack of concensus over meaning?
- can it cope with context-dependent nature of identification?
- can it cope with non-trivial cases? HTML?
- can it cope with Turing-complete cases?
Even if is possible, it is feasible?
- We’ve been saying we need format corpora and registries etc for a decade.
- We still have PRONOM, but that’s it (as shared infrastructure)
- Work may be done in private, but global FEs committed to working on this are very few.
- Even thinks like CRISP completely failed. What can we learn from that?
- And the amount of work is absolutely massive.
- Who created those benchmarks in other fields? Who paid them to do it? Why?
- Are they different because the creators also control the meaning whereas we are documenting the meanings of others?