The Web Archive and the Catalogue


The British Library has a long tradition of preserving the heritage of the United Kingdom, and processes for handling and cataloguing print-based media are deeply ingrained in the organisations structure and thinking. However, as an increasing number of government and other publications move towards online-only publication, we are force to revisit these processes and explore what needs to be changed in order to avoid the web archive becoming an massive, isolated silo, poorly integrated with other collection material. We have started this journey by looking at how we collect official documents, like government publications and e-journals. As we are already tasked with archiving UK web publications, the question is not so much ‘how to we collect these documents?’ rather ‘how to we find the documents we’ve already collected?’. Our current methods for combining curatorial expertise with machine-generated metadata will be discussed, leading to an outline of the lessons we have learned. Finally, we will explore how the ability to compare the library’s print catalogue data with the web archive enables us to study the steps institutions and organisations have taken as they have moved online.

Read More

Posted: 2017-06-28

Data Mining  Web Archives  Digital Preservation  webarchive-discovery

Revitalising the UK Web Archive

Originally published on the UK Web Archive blog on the 8th of June 2017.

It’s been over a year since we made our historical search system available, and it’s proven itself to be stable and useful. Since then, we’ve been largely focussed on changes to our crawl system, but we’ve also been planning how to take what we learned in the Big UK Domain Data for the Arts and Humanities project and use it to re-develop the UK Web Archive.

Our current website has not changed much since 2013, and doesn’t describe who we are and what we do now that the UK Legal Deposit regulations are in place. It only describes the sites we have crawled by permission, and does not reflect the tens of thousands of sites and URLs that we have curated and categorised under Legal Deposit, nor the billions of web pages in the full collection. To try to address these issues, we’re currently developing a new website that will open-up and refresh our archives.

One of the biggest challenges is the search index. The 3.5 billion resources we’ve indexed for SHINE represents less than a third of...

Read More

Posted: 2017-06-09

Data Mining  Web Archives  BUDDAH  webarchive-discovery

More than just a copy

Following my previous post, a tweet from Raffaele Messuti lead me to this quote:

“Computers, by their nature, copy. Typing this line, the computer has copied the text multiple times in a variety of memory registers. I touch a button to type a letter, this releases a voltage that is then translated into digital value, which is then copied into a memory buffer and sent to another part of the computer, copied again into RAM and sent to the graphics card where it is copied again, and so on. The entire operation of a computer is built around copying data: copying is one of the most essential characteristics of computer science. One of the ontological facts of digital storage is that there is no difference between a computer program, a video, mp3-song, or an e-book. They are all composed of voltage represented by ones and zeros. Therefore they are all subject to the same electronic fact: they exist to be copied and can only ever exist as copies.” From Radical Tactics of the Offline Library via an annotation by @atomotic.

Copying is indeed fundamental to how computers function, and we need to understand...

Read More

Posted: 2017-04-30

Digital Preservation  Keeping Codes  Lessons Learned


Fighting entropy since 1993

© Dr Andrew N. Jackson — CC-BY