Revitalising the UK Web Archive

Originally published on the UK Web Archive blog on the 8th of June 2017.

It’s been over a year since we made our historical search system available, and it’s proven itself to be stable and useful. Since then, we’ve been largely focussed on changes to our crawl system, but we’ve also been planning how to take what we learned in the Big UK Domain Data for the Arts and Humanities project and use it to re-develop the UK Web Archive.

Our current website has not changed much since 2013, and doesn’t describe who we are and what we do now that the UK Legal Deposit regulations are in place. It only describes the sites we have crawled by permission, and does not reflect the tens of thousands of sites and URLs that we have curated and categorised under Legal Deposit, nor the billions of web pages in the full collection. To try to address these issues, we’re currently developing a new website that will open-up and refresh our archives.

One of the biggest challenges is the search index. The 3.5 billion resources we’ve indexed for SHINE represents less than a third of...

Posted: 2017-06-09

Data Mining  Web Archives  BUDDAH

More than just a copy

Following my previous post, a tweet from Raffaele Messuti lead me to this quote:

“Computers, by their nature, copy. Typing this line, the computer has copied the text multiple times in a variety of memory registers. I touch a button to type a letter, this releases a voltage that is then translated into digital value, which is then copied into a memory buffer and sent to another part of the computer, copied again into RAM and sent to the graphics card where it is copied again, and so on. The entire operation of a computer is built around copying data: copying is one of the most essential characteristics of computer science. One of the ontological facts of digital storage is that there is no difference between a computer program, a video, mp3-song, or an e-book. They are all composed of voltage represented by ones and zeros. Therefore they are all subject to the same electronic fact: they exist to be copied and can only ever exist as copies.” From Radical Tactics of the Offline Library via an annotation by @atomotic.

Copying is indeed fundamental to how computers function, and we need to understand...

Posted: 2017-04-30

Digital Preservation  Keeping Codes  Lessons Learned

Digital Preservation: Lessons Learned?

I find working in digital preservation fascinating.

It’s not where I expected to end up. I started off interested in computing and science, and happened to find out about what was then a fairly young MPhys degree in Computation Physics offered by the University of York1. I then did a Ph.D. in Computational Physics at Edinburgh University, working in statistical physics. After that, I spent my time oscillating between being a post-graduate researcher who used large-scale computational methods, and being a computational specialist who helped other scientists make use of those kinds of techniques.

I’d decided to move away from research and get a ‘normal’ industry programmer job, so when we moved to Leeds I applied for a few different positions. One of them turned out to be for the PLANETS Project, based at the British Library. I liked the place and the people, and the work sounded interesting, allowing me to expand my previous experience (not just in computation, but also the information theory that underlies statistical physics) to a new field. And Industry was spared my woolly ways.

I spent a happy few years working on the PLANETS Project and helping kick-off the...

Posted: 2017-04-04

Fighting entropy since 1993

© Dr Andrew N. Jackson — CC-BY