Web Archiving In The JavaScript Age

Posted: 2014-08-11

First published on the UK Web Archive blog.


Among the responses to our earlier post, How much of the UK’s HTML is valid?, Gary McGath’s HTML and fuzzy validity deserves to be highlighted, as it explores a issue very close to our hearts: how to cope when the modern web is dominated by JavaScript.

In particular, he discusses one of the central challenges of the Age Of JavaScript: making sure you have copies of all the resources that are dynamically loaded as the page is rendered. We tend to call this ‘dependency analysis’, and we consider this to be a much more pressing preservation risk than bit rot or obsolescence. If you never even know you need something, you’ll never go get it and so never even get the chance to preserve it.

The Ubiquity of <script>

To give you an idea of the problem, the following graph shows how the usage of the <script> tag has varied over time:

The percentage of archived pages that use the <script> tag, over time.

In 1995, almost no pages used the <script> tag, but fifteen years later, over 95% of web pages require JavaScript. This has been a massive sea-change in the nature of the world wide web, and web archives have had to react to it or face irrelevance.

Rendering While Crawling

For example, for the Internet Archive’s Archive-It Service, they have developed the Umbra tool, which uses a browser testing engine based on Google Chrome to process URLs sent from the Heritrix cralwer, extract the additional URLs that content depends upon, and send them back to Heritrix to be crawled.

We use a similar system during out crawls, including domain crawls. However, rendering web pages takes time and resources, so we don’t render every single URL of the billions in each domain crawl. Instead, we render all host home-pages, and we render the ‘catalogued’ URLs that our curators have indicated are of particular interest. The architecture is similar to that used by Umbra, based around our own page rendering service.

We’ve been doing this since the first domain crawl in 2013, and so this seems to be one area where the web archives are ahead of Google and their attempts to understand web pages better.

Saving Screenshots

Furthermore, given we are having to render the pages anyway, we have used this as an opportunity to take screenshots of the original web pages during the crawl, and to add those screenshots to the archival store (we’ll cover more of the details on that in a later blog post). This means we are in a much better position to evaluate any future preservation actions we might require to reconstruct the rendering process, and we expect these historical screenshots to be of great interest to the researchers of the future.

Next in series: What Have We Saved? » « Previous in series: How much of the UK's H...

Web Archives  Digital Preservation  Data Mining

Mining Web Archives

Using data-mining techniques to explore, understand and utilise large-scale web archives.

Blog Series

  Building Web Archives 17

  Digital Preservation Lessons Learned 7

  Digital Dark Age 7

  Format Identification 3

  Format Registries 6

  Mining Web Archives 17

Recent Posts

Tags

Websites (13) Travels (47) General (1) Development (7) Top Tips (4) Science (7) Rants (3) Top Links (2) Reviews (2) Visualisation (3) Digital Preservation (45) Procrastination (2) Data Mining (16) Open Access (1) Web Archives (35) Representation Information (2) Format Registry (4) SCAPE (3) webarchive-discovery (7) War Stories (1) Preservation Actions (2) BUDDAH (5) Publications (3) Digital Humanities (1) Collaboration (1) Keeping Codes (6) Lessons Learned (6) Reports (5)


Posted: 2014-08-11 | anj

 

Fighting entropy since 1993

© Dr Andrew N. Jackson — CC-BY

Elsewhere

Contact