Skip to main content
  1. Projects/

Web Archive Discovery Toolkit

246 words
Development Development Digital Preservation
Table of Contents

The open source tools that powered the AADDA and BUDDAH projects, and the UK Web Archive search service, is called webarchive-discovery. It can be used to analyse WARC files and build services to help search and explore web archives.

ukwa/webarchive-discovery

WARC and ARC indexing and discovery tools.

Java
113
24

Analysing WARC files #

The toolkit combines an extended version of Apache Tika and WARC and ARC reading software with a number of other data and text analysis systems. Via a command line interface, or as a Hadoop task, it can parse large volumes of web archives and submit the data to a suitably configured Apache Solr, ElasticSearch or OpenSearch index.

For more information, please refer to the webarchive-discovery wiki

Supporting SolrWayback #

As the system can be used to create Apache Solr or ElasticSearch/OpenSearch search indexes, those indexes can be used with any user-interface tools that support them. This includes general purpose search interfaces like Blacklight, as shown by the Warclight project.

However, to make the most of the information in the index, a more specialised user interface is needed. Our colleagues at The Royal Danish Library built just such a tool, called SolrWayback.

netarchivesuite/solrwayback

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

Java
91
18

This two tools have proven to be a successful combination, and are now deployed together at various web archives across the world.

See the SolrWayback README for more information, including links to demonstration sites.