Collecting Data To Improve Tools

Posted: 2014-11-20

First published on the UK Web Archive blog.

Like many other institutions, we are heavily dependent on a number of open source tools. We couldn’t function without them, and so we like to find ways to give back to those communities. We don’t have a lot of spare time or development capacity to contribute, but recently we have found another way to provide useful feedback.

Large-scale extraction

At the heart of our discovery stack lies Apache Tika, the piece of software we use to try to parse the myriad of data formats in our collection in order to extract the textual representation (along with any useful metadata) that goes into our search indexes. Consequently, we have now executed Apache Tika on many billions of distinct resources, dating from 1995 to the present day. Due to the age and variablity of the content, this often tests Tika to it’s limits. As well as failing to identify many formats, it sometimes simply fails, throwing out an unexpected error, or by getting locked in a infinite loop.

Logging losses

Each of those failures represents a loss – a resource that may never be discovered because we can’t understand it. This may be because it’s malformed, perhaps even damaged during download. It may also be an sign of obsolescence, in that it may indicate the presence of data formats that are poorly understood, and are therefore likely to present a challenge to our discovery and access systems. So, instead of ignoring these errors, we decided to remember them. Specifically, each is logged as a facet of our full-text index, alongside the identity of the resource that caused the problem.

Sharing the results

We’ve been collecting this data for a while, in order to help us tell a broken bitstream from a forgotten format. However, in a recent discussion with the Apache Tika developers, they have indicated that they would also find this data useful as a way of improving the coverage and robustness of their software.

This turns out to be a win-win situation. We store the data we were intending to store anyway, but also share it with the tool developers, who get to improve their software in ways we will be able to take direct advantage of as we run later versions of the tool over our archives in the future.

And it feels good to give a little something back.

Next in series: Towards a Macroscope f... » « Previous in series: What Have We Saved?

Web Archives  webarchive-discovery  Data Mining

Mining Web Archives

Using data-mining techniques to explore, understand and utilise large-scale web archives.

Blog Series

  Building Web Archives 17

  Digital Preservation Lessons Learned 7

  Digital Dark Age 7

  Format Identification 3

  Format Registries 6

  Mining Web Archives 17

Recent Posts


Websites (13) Travels (47) General (1) Development (7) Top Tips (4) Science (7) Rants (3) Top Links (2) Reviews (2) Visualisation (3) Digital Preservation (45) Procrastination (2) Data Mining (16) Open Access (1) Web Archives (35) Representation Information (2) Format Registry (4) SCAPE (3) webarchive-discovery (7) War Stories (1) Preservation Actions (2) BUDDAH (5) Publications (3) Digital Humanities (1) Collaboration (1) Keeping Codes (6) Lessons Learned (6) Reports (5)

Posted: 2014-11-20 | anj


Fighting entropy since 1993

© Dr Andrew N. Jackson — CC-BY