Skip to main content
  1. Blog/

Can a Web Archive Lie?

Your lyin' archives| ··694 words
mining-web-archives Data Mining
Andy Jackson
Author
Andy Jackson
Fighting entropy since 1993

This is the script for the introduction I gave as part of a ‘Digital Conversations at the BL’ panel event: Web Archives: truth, lies and politics in the 21st century on Wednesday 14th of June, as part of Web Archiving Week 2017.

My role is to help build a web archive for the United Kingdom that can be used to separate fact from fiction. But do to that, you need to be able to trust that the archived content we present to you is what it purports to be.

Which raises the question: Can a web archive lie?

Well it can certainly be confusing. For example, one seemingly simple question that we are sometimes asked is: How big is the UK web? Unfortunately, this is actually quite a difficult question. First, unlike print, many web pages are generated by algorithms, which means the web is technically infinite. Even putting that aside, to answer this question precisely we’d need to capture every version of everything on the web. We just can’t do that. Even if we could download every version of everything we know about, there’s also the problem of all the sites that we failed to even try to capture.

A web archive can also be misleading – most obviously through omission. Sometimes, this might be because of unindented biases introduced by the process by which we select sites for inclusion and higher-quality crawling. Having an open nominations process for the archive can help, but the diversity of those involved with web archives is pretty low. We also know that we lose a lot of content due to the complexity of current web sites and the limitations of our current crawling technologies.

A web archive can also mislead in other ways. When presenting web archives, we use the date we captured the resource as our time axis. This matters because users usually expect that documents appear arranged by their date of publication. We generally don’t know the publication date, and because of the way the web crawling process works, the time line can get mixed up because the crawler will tend to discover documents based on their popularity, in terms of how many other sites link to them. With fast-moving events like news and current affairs, this can become very misleading and is something I expect we’ll have to address more directly in the future.

One way to do this is to start to bring in more structured data from multiple sources, like Twitter or other APIs. These systems usually do provide authoritative publication dates and timestamps. The trick is going to be working our how to blend these different data sources together to improve the way we present our time line.

But can a web archive outright lie? For example, can it say something was on the web at a particular time when in truth it was not even written yet?

Well, yes, it certainly could. Digital material is very malleable - if someone modifies it, that often leaves no traces upon the item itself. As digital resources become increasingly important as historical records, we must expect more and more attempts to hack our archives. Obviously, we take steps to prevent ourselves from being hacked, but how on earth do we prove to you that we’ve done a good job. If not getting hacked is our only answer, it just becomes a massive single point of failure. One breach, and the authenticity of everything digital item we hold is brought into question.

When building large, distributed computer systems, you have to engineer for failure. When we build large, long-lived information systems, we have to take the same approach. We have to work out how to ensure the historical record is trustworthy, even if our institution is hacked.

A hacker may be able to hack one organisation, but simultaneously and consistently hacking multiple independent organisations is much, much harder. As our historical record becomes born-digital, the libraries and archives of the world are going to have to find ways to support each other, and build a chain of care that is so wide and so entangled, it simply can’t be hacked without a trace.