Tools for Legal Deposit

Before I revisit the ideas explored in the first post in the blog series I need to go back to the start of this story…

Between 2003 and 2013 – before the Non-Print Legal Deposit regulations came into force – the UK Web Archive could only archive websites by explicit permission. During this time, the Web Curator Tool (WCT) was used to manage almost the entire life-cycle of the material in the archive. Initial processing of nominations was done via a separate Selection & Permission Tool (SPT), and the final playback was via a separate instance of Wayback, but WCT drove the rest of the process.

Of course, selective archiving is valuable in it’s own right, but this was also seen as a way of building up the experience and expertise required to implement full domain crawling under Legal Deposit. However, WCT was not deemed to be a good match for a domain crawl. The old version of Heritrix embedded inside WCT was not considered very scalable, was not expected to be supported for much longer, and was difficult to re-use or replace because of the way it was baked inside WCT.

Read More

Posted: 2017-10-19

Web Archives  Digital Preservation

Digging Documents Out Of The Archived Web


As an increasing number of government and other publications move towards online-only publication, we are force to move our traditional Legal Deposit processes based on cataloging printed media. As we are already tasked with archiving UK web publications, the question is not so much ‘how to we collect these documents?’ rather ‘how to we find the documents we’ve already collected?’. This presentation will explore the issues we’ve uncovered as we’ve sought to integrate our web archives with our traditional document cataloging processes, especially around official publications and e-journals. Our current Document Harvester will be described, and it’s avantages and limitations explored. Our current methods for exploiting machine-generated metadata will be discussed, and an outline of our future plans for this type of work will be presented.

Read More

Posted: 2017-06-30

Data Mining  webarchive-discovery

Can a Web Archive Lie?

This is the script for the introduction I gave as part of a ‘Digital Conversations at the BL’ panel event: Web Archives: truth, lies and politics in the 21st century on Wednesday 14th of June, as part of Web Archiving Week 2017.

My role is to help build a web archive for the United Kingdom that can be used to separate fact from fiction. But do to that, you need to be able to trust that the archived content we present to you is what it purports to be.

Which raises the question: Can a web archive lie?

Well it can certainly be confusing. For example, one seemingly simple question that we are sometimes asked is: How big is the UK web? Unfortunately, this is actually quite a difficult question. First, unlike print, many web pages are generated by algorithms, which means the web is technically infinite. Even putting that aside, to answer this question precisely we’d need to capture every version of everything on the web. We just can’t do that. Even if we could download every version of everything we know about, there’s also the problem of all the sites that we failed to even...

Read More

Posted: 2017-06-29

Data Mining

The Web Archive and the Catalogue


The British Library has a long tradition of preserving the heritage of the United Kingdom, and processes for handling and cataloguing print-based media are deeply ingrained in the organisations structure and thinking. However, as an increasing number of government and other publications move towards online-only publication, we are force to revisit these processes and explore what needs to be changed in order to avoid the web archive becoming an massive, isolated silo, poorly integrated with other collection material. We have started this journey by looking at how we collect official documents, like government publications and e-journals. As we are already tasked with archiving UK web publications, the question is not so much ‘how to we collect these documents?’ rather ‘how to we find the documents we’ve already collected?’. Our current methods for combining curatorial expertise with machine-generated metadata will be discussed, leading to an outline of the lessons we have learned. Finally, we will explore how the ability to compare the library’s print catalogue data with the web archive enables us to study the steps institutions and organisations have taken as they have moved online.

Read More

Posted: 2017-06-28

Data Mining  Web Archives  Digital Preservation  webarchive-discovery

Revitalising the UK Web Archive

Originally published on the UK Web Archive blog on the 8th of June 2017.

It’s been over a year since we made our historical search system available, and it’s proven itself to be stable and useful. Since then, we’ve been largely focussed on changes to our crawl system, but we’ve also been planning how to take what we learned in the Big UK Domain Data for the Arts and Humanities project and use it to re-develop the UK Web Archive.

Our current website has not changed much since 2013, and doesn’t describe who we are and what we do now that the UK Legal Deposit regulations are in place. It only describes the sites we have crawled by permission, and does not reflect the tens of thousands of sites and URLs that we have curated and categorised under Legal Deposit, nor the billions of web pages in the full collection. To try to address these issues, we’re currently developing a new website that will open-up and refresh our archives.

One of the biggest challenges is the search index. The 3.5 billion resources we’ve indexed for SHINE represents less than a third of...

Read More

Posted: 2017-06-09

Data Mining  Web Archives  BUDDAH  webarchive-discovery

More than just a copy

Following my previous post, a tweet from Raffaele Messuti lead me to this quote:

“Computers, by their nature, copy. Typing this line, the computer has copied the text multiple times in a variety of memory registers. I touch a button to type a letter, this releases a voltage that is then translated into digital value, which is then copied into a memory buffer and sent to another part of the computer, copied again into RAM and sent to the graphics card where it is copied again, and so on. The entire operation of a computer is built around copying data: copying is one of the most essential characteristics of computer science. One of the ontological facts of digital storage is that there is no difference between a computer program, a video, mp3-song, or an e-book. They are all composed of voltage represented by ones and zeros. Therefore they are all subject to the same electronic fact: they exist to be copied and can only ever exist as copies.” From Radical Tactics of the Offline Library via an annotation by @atomotic.

Copying is indeed fundamental to how computers function, and we need to understand...

Read More

Posted: 2017-04-30

Digital Preservation  Keeping Codes  Lessons Learned


Fighting entropy since 1993

© Dr Andrew N. Jackson — CC-BY