The open source codebase that powered the AADDA and BUDDAH projects and all the other UK Web Archive search services is called webarchive-discovery.

It combines an extended version of Apache Tika and WARC and ARC reading software with a number of other data and text analysis systems. Via a command line interface, or as a Hadoop task, it can parse large volumes of web archives and submit the data to a suitably configured Apache Solr index. For more information, please refer to the webarchive-discovery wiki



Fighting entropy since 1993

© Dr Andrew N. Jackson — CC-BY