Making websites archivable with JavaScript

Table of Contents

A recent comment from the #DHNB2023 conference caught my eye…

If you just follow web archiving friendly standards, you’ll have a really boring website.

Which made me laugh. But it also made me think about how we archive websites now, and I realized this isn’t necessarily true anymore.

Dynamic Dependencies #

The established view is pretty much that JavaScript is bad. More precisely, as the UK National Archives compliance guidelines puts it:

Dynamically-generated page content using client-side scripting cannot be captured.
How to make your website archive compliant: Dynamically-generated content and scripts

This is certainly true in general, but it is also a reflection of the common ways websites are created. Many websites use a content management system like Wordpress or Drupal, where the same system is used for editing and for access, where pages are stored in a database, and where the web pages are usually generated on demand. These systems are often used to provide rich interfaces for searching and browsing, and those are heavily dependent on the back-end database to generate the vast number of possible views.

If your site uses databases to support its functions, these can only be captured in a limited fashion.
How to make your website archive compliant: Database and lookup functions

The addition of client-side JavaScript usually makes this kind of thing worse. Design patterns like the JAM Stack move a lot of the page generation complexity into the client browser as JavaScript, which talks to a back-end API to generate the page content. This gives us two problems, the complex JavaScript and the infinite possibilities of API calls.

But modern patterns in how JavaScript is bundled and deployed often mean we can capture the front-end code. And even when that’s not the case, we have got better at archiving JavaScript, thanks largely to the work of Webrecorder and tools like ArchiveWeb.Page and Browsertrix. In general, it’s not the client JavaScript that is the problem, the problem is that it’s hard to work out what further resources the page is going to need when you use it.

But if JavaScript-heavy pages are explicit about their dependencies, and only read data from a server rather than updating it, they can be archived perfectly well.

Even emulators like JSSpeccy or The Emularity. Even text search systems like ElasticLunr. Or rich faceted and interactive databases like Datasette-Lite. Or online format identification tools like Siegfried JS. Or even, up to a point, fully interactive computational environments like JupyterLite, or Linux!. As long as the dependencies can be determined, web archiving can work.

Use Case: Registry/Directory Sites #

There’s a lot of dead projects and long-lost data-driven sites. Wikis, tool directories, format registries, generated by short-term project funding, which then struggle to stay alive after the project ends. Web archiving can help, but usually rich search and browse interfaces are lost, because the back-end database is needed to make that happen.

Minimal Computing & Static Sites #

The idea of Minimal Computing can help here, particularly static website tools like the WAX digital exhibitions system. WAX takes simple inputs (CSV files and images) and generates static HTML pages, along with a ElasticLunr client-side search system. The search feature require a JSON file, https://minicomp.github.io/wax/search/index.json, which contains the index data. As things are, a web crawler would probably not manage to spot that automatically, but that’s the only problem from a web archiving point of view.

This model, using HTML + JavaScript as a sustainable platform, is a very appealing one. Both HTML and JavaScript are likely to remain a backward-compatible platform for many years, and even if the JavaScript breaks, we only lose the search functionality.

(This point was also explored as a small part of Ed Summer’s excellent post on The Web as a Preservation Medium.)

Richer Sites on Static Assets #

Which all leaves me wondering if it would be useful to build on the WAX approach and develop a toolkit for registry or directory-style web sites? What features could we add?

Database Included #

By bringing in tools like the aforementioned Datasette-lite, or similar systems like sql.js, it would be possible to embed a full copy of the underlying database as a static asset (e.g. an sqlite file). This could be used to power more complex interfaces, like faceted browsing, but could also be downloaded and used directly.

Data Published #

Each item from the underlying database could also be published in machine-readable format as well as HTML pages. e.g. a JSON URL for every page, acting as an API making the raw data easier to access and re-use. (WAX may already do this, but I haven’t managed to confirm this yet.)

Dependencies Declared #

With the WAX example, the only thing preventing easy web archiving was that the index.json dependency wasn’t explicitly declared in the HTML. It would be great to design and promote a standard practice for declaring such dependencies, e.g. as pre-fetch link directives. With those in place, even simple web crawlers would be able to find everything we need to archive the site.

But then, perhaps that’s beside the point. If more people made more sites using techniques that last longer and cost less to sustain, we might hope for a web that needs less archiving.