Digging Documents Out Of The Archived Web

Posted: 2017-06-30


As an increasing number of government and other publications move towards online-only publication, we are force to move our traditional Legal Deposit processes based on cataloging printed media. As we are already tasked with archiving UK web publications, the question is not so much ‘how to we collect these documents?’ rather ‘how to we find the documents we’ve already collected?’. This presentation will explore the issues we’ve uncovered as we’ve sought to integrate our web archives with our traditional document cataloging processes, especially around official publications and e-journals. Our current Document Harvester will be described, and it’s avantages and limitations explored. Our current methods for exploiting machine-generated metadata will be discussed, and an outline of our future plans for this type of work will be presented.

These are the slides for the presentation I gave as part of Web Archiving Week 2017, on Thursday 15th of June.

Journey of a (print) collection item

Original Digital Processing Workflow

Document Harvester Workflow

What is a Publication?

Example gov.uk publication

MARC & Cataloguing Standards

Metadata Extraction

Example gov.uk API Data

Resolving References

Layers of Transformation

Oh No! I Made Another Chain

Future Experimentation

Next in series: Revisiting Web Rings » « Previous in series: Can a Web Archive Lie?

Data Mining  webarchive-discovery

Mining Web Archives

Using data-mining techniques to explore, understand and utilise large-scale web archives.

Blog Series

  Building Web Archives 16

  Digital Preservation Lessons Learned 7

  Digital Dark Age 7

  Format Identification 3

  Format Registries 6

  Mining Web Archives 17

Recent Posts


Websites (13) Travels (47) General (1) Development (7) Top Tips (4) Science (7) Rants (3) Top Links (2) Reviews (2) Visualisation (3) Digital Preservation (45) Procrastination (2) Data Mining (16) Open Access (1) Web Archives (34) Representation Information (2) Format Registry (4) SCAPE (3) webarchive-discovery (7) War Stories (1) Preservation Actions (2) BUDDAH (5) Publications (3) Digital Humanities (1) Collaboration (1) Keeping Codes (6) Lessons Learned (6) Reports (5)

Posted: 2017-06-30 | anj


Fighting entropy since 1993

© Dr Andrew N. Jackson — CC-BY