Skip to main content
  1. Blog/

Communities of Digital Preservation

IIPC Blog Post on Preserving the Web with Open Source| ·1249 words
Building-Web-Archives Digital Preservation Web Archives
Andy Jackson
Author
Andy Jackson
Fighting entropy since 1993
Table of Contents

By Andy Jackson, Web Archiving Technical Lead at the British Library (until January 2024)

I joined the UK Web Archive early in 2012, during the build-up to our very first UK domain crawl. As I started to understand what the team did, it became very clear that the collaboration with the wider IIPC web archiving community had been crucial to the team’s success, and would be a vital part of our future work.

The knowledge sharing and socialising at the IIPC conferences provide the fundamental rhythm, but the web archiving community has arranged all sorts of beats over that bass drum. Not just special events, both online and in person (e.g. technical training and a hackathon held at the British Library), but also through the way we build our shared tools. My research career had often involved using open source software, but in web archiving I began to understand how those same approaches had been used to share the load of developing standard practices, embodied by specialist tools. I also began to see how this could empower people and organisations to run their own web archiving operations.

Buy or Build? #

While the public awareness of web archiving has certainly risen over the years, it remains something of a niche concern. It has been over twenty years since a small group of cultural heritage organisations kicked things off, writing and sharing their own tools to archive the web. In the intervening years the heritage community has grown a great deal, but most of today’s archival web crawlers are still built on those first foundations. There seems to be a reasonable market for ‘medium-scale’ web archiving, with a few different vendors offering various services at that scale. But at the extremes, with personal web archiving at one end and Legal Deposit domain crawls at the other, there are all sorts of constraints that make it difficult to take advantage of those commercial offerings.

Sometimes, you have to build your own tools. But, if you must build your own, you can try to find others with similar needs and look for common ground to share. Open source licences and development practices have clearly been pivotal to helping this happen in web archiving, leading to the widespread use of Heritrix for web crawling and of the original Java Wayback playback engine. This was a success story I wanted to join in with, and a community I wanted to help grow.

Barriers to Collaboration #

Seeing this historical success, I took it for granted that of course our institutions would understand and support this. That anyone using these tools would be able and keen to collaborate. Why keep fixing the same bugs alone when we could fix each one once by working together?

That was very naive of me. There are lots of reasons why the open source model of collaboration can be difficult to adopt. The relationships between organisational needs and Information Technology service delivery are incredibly varied and complex. It can be very difficult to get the space and permission to experiment. It can be extremely difficult to build up or pull in the skills we need.

Even where people would like to collaborate more, there are often perfectly understandable personal or professional constraints that mean they can’t just pitch in. I am very fortunate that my direct managers and colleagues at the British Library supported my strategy of working in the open. I am also fortunate that I risk very little by doing so. It took me a while to realise what a privilege that is.

The desire to overcome these barriers was part of the reason why I helped start up the regular Online Hours calls to support the teams and individuals who rely on our shared tools, and provide a safe and friendly forum for anyone who is interested in talking about them.

Investing in Open Source #

I’ve also tried to support and encourage direct investment in shared tooling, both through IIPC and the British Library. I’ve been particularly pleased by the project to extend the GLAM Workbench to explore web archives, the project to help IIPC members make use of the Browsertrix Cloud crawl system, and the project to help everyone move from OpenWayback to pywb. It’s also been great to see the increased adoption of the webarchive-discovery WARC indexing toolkit, largely driven by the excellent SolrWayback search interface project.

In January, I left the British Library to work at the Digital Preservation Coalition. I suspect I’ll reconnect with web archiving at some point in the future, in one form or another, but for now, I’m looking forward to taking what I’ve learned and applying it anew. Because at some point I realised that open source isn’t just about making do with not-much money. It’s about digital preservation too.

Critical Dependencies #

One of the core concepts in digital preservation is the idea of Representation Information, which provides a way to formally recognise the additional information we need to make our collections accessible. Crucially, this includes software. After all, the thing that makes digital objects digital is the fact that we need software to use them.

This is where proprietary systems can become a significant risk to digital preservation. Perhaps the most important part of digital preservation is identifying single points of failure within the chain of dependencies that access requires. If playback depends on a single service provider, it’s at risk. Long-term preservation demands interoperability, which is why the WARC standard exists in the first place.

The WARC standard is our foundation stone, but that alone is not enough to make those frozen fragments sing. We can’t grasp what landed in our ‘response’ records without being able to understand the mechanisms that put them there. And we can’t analyse and explore our petrified webs without the software tools that bring them to life. There is no ‘ISO standard for playback’ (and I doubt such a thing is even possible), so we must instead preserve the software that makes playback work. This is why having at least one open source playback system is a crucial concern for the members of the IIPC.

But this is not just true for web archiving. This same story plays out across the whole of digital preservation. The wider shift to open source, and the work that the global community has put into open source implementations of widespread formats, has become the backbone of every digital preservation programme. We’re not out here re-implementing libtiff, or writing PDF readers based on the ISO spec. We’re all re-using open source implementations that are being maintained by the wider community. We’re all in the business of preserving software, at least to some degree.

Communities of Practice #

The success of the community-maintained Web Archiving Awesome List, the way organisations have transitioned to pywb ( like this) and the growing support for Browsertrix Cloud show that the web archiving community understands this. That one way to sustainable, shared practices is through shared tools as well as common purpose. These tactics don’t only help established archives do their work, but also make it easier for ‘younger’ archives to join in and so grow the community around those tools.

My new role is all about helping digital preservation practitioners discover and build on the good practice of others. I will take what I’ve learned from web archiving with me, and come back to this community as an exemplar of what we can achieve when we work together.