Blog

Building Tools to Archive the Modern Web

Four years ago, during the 2012 IIPC General Assembly, we came together to discuss the recent and upcoming challenges to web archiving in the Future of the Web Workshop (see also this related coverage on David Rosenthal’s blog). That workshop made it clear that our tools are failing to satisfy many of these challenges:

  • Database driven features
  • Complex/variable URI formats
  • Dynamically generated URIs
  • Rich, streamed media
  • Incremental display mechanisms
  • Form-filling
  • Multi-sourced, embedded content
  • Dynamic login, user-sensitive embeds
  • User agent adaptation
  • Exclusions (robots.txt, user-agent, …)
  • Exclusion by design (i.e. site architecture intended to inhibit crawling and indexing)
  • Server-side scripts, RPCs
  • HTML5 web sockets
  • Mobile sites
  • DRM protected content, now part of the HTML standard
  • Paywalls

I wish I could stand here and tell you how much great progress we’ve made in the last four years, ticking entries off this list, but I can’t. Although we’ve made some progress, our crawl development resources have been consumed by more basic issues. We knew moving to domain crawling under Legal Deposit would bring big changes in scale, but I’d underestimated how much the dynamics of the crawl workflow would need to change.

News websites are a great example.

Read More

Posted: 2016-04-11

Web Archives

Updating our historical search service

Originally published on the UK Web Archive blog on the 15th of February 2016.

Earlier this year, as part of the Big UK Domain Data for the Arts and Humanities project, we released our first ‘historical search engine’ service. We’ve publicised it at IDCC15, the 2015 IIPC GA and at the first RESAW conference, and it’s been very well received. Not only has it lead to some excellent case studies that we can use to improve our services, but other web archives have shown interest in re-using the underlying open source code. In particular, some of our Canadian colleagues have successfully launched webarchives.ca, which lets users search ten years worth of archived websites from Canadian political parties and political interest groups (see here for more details).

But we remained frustrated, for two reasons. Firstly, when we built that first service, we could not cope with the full scale of the 1996-2013 dataset, and we only managed to index the two billion resources up to 2010. Secondly, we had not yet learned how to cope with more than one or two users at a time, so we were loath to...

Read More

Posted: 2016-02-16

Data Mining  Web Archives  BUDDAH

The provenance of web archives

Originally published on the UK Web Archive blog on the 20th November 2015.

Over the last few years, it’s been wonderful to see more and more researchers taking an interest in web archives. Perhaps we’re even teetering into the mainstream when a publication like Forbes carries an article digging into the gory details of how we should document our crawls in How Much Of The Internet Does The Wayback Machine Really Archive?

Read More

Posted: 2015-11-20

Web Archives  BUDDAH  Data Mining  Digital Preservation

 

Fighting entropy since 1993

© Dr Andrew N. Jackson — CC-BY

Elsewhere

Contact