If you just follow web archiving friendly standards, you’ll have a really boring website.
Which made me laugh. But it also made me think about how we archive websites now, and I realized this isn’t necessarily true anymore.
Dynamically-generated page content using client-side scripting cannot be captured.
How to make your website archive compliant: Dynamically-generated content and scripts
This is certainly true in general, but it is also a reflection of the common ways websites are created. Many websites use a content management system like Wordpress or Drupal, where the same system is used for editing and for access, where pages are stored in a database, and where the web pages are usually generated on demand. These systems are often used to provide rich interfaces for searching and browsing, and those are heavily dependent on the back-end database to generate the vast number of possible views.
If your site uses databases to support its functions, these can only be captured in a limited fashion.
How to make your website archive compliant: Database and lookup functions
Even emulators like JSSpeccy or The Emularity. Even text search systems like ElasticLunr. Or rich faceted and interactive databases like Datasette-Lite. Or online format identification tools like Siegfried JS. Or even, up to a point, fully interactive computational environments like JupyterLite, or Linux!. As long as the dependencies can be determined, web archiving can work.
Use Case: Registry/Directory Sites
There’s a lot of dead projects and long-lost data-driven sites. Wikis, tool directories, format registries, generated by short-term project funding, which then struggle to stay alive after the project ends. Web archiving can help, but usually rich search and browse interfaces are lost, because the back-end database is needed to make that happen.
Minimal Computing & Static Sites
The idea of Minimal Computing can help here, particularly static website tools like the WAX digital exhibitions system. WAX takes simple inputs (CSV files and images) and generates static HTML pages, along with a ElasticLunr client-side search system. The search feature require a JSON file, https://minicomp.github.io/wax/search/index.json, which contains the index data. As things are, a web crawler would probably not manage to spot that automatically, but that’s the only problem from a web archiving point of view.
Richer Sites on Static Assets
Which all leaves me wondering if it would be useful to build on the WAX approach and develop a toolkit for registry or directory-style web sites? What features could we add?
By bringing in tools like the aforementioned Datasette-lite, or similar systems like sql.js, it would be possible to embed a full copy of the underlying database as a static asset (e.g. an sqlite file). This could be used to power more complex interfaces, like faceted browsing, but could also be downloaded and used directly.
Each item from the underlying database could also be published in machine-readable format as well as HTML pages. e.g. a JSON URL for every page, acting as an API making the raw data easier to access and re-use. (WAX may already do this, but I haven’t managed to confirm this yet.)
With the WAX example, the only thing preventing easy web archiving was that the
index.json dependency wasn’t explicitly declared in the HTML. It would be great to design and promote a standard practice for declaring such dependencies, e.g. as
pre-fetch link directives. With those in place, even simple web crawlers would be able to find everything we need to archive the site.
But then, perhaps that’s beside the point. If more people made more sites using techniques that last longer and cost less to sustain, we might hope for a web that needs less archiving.Next in series: UK Web Archive Technic... » « Previous in series: UK Web Archive Technic...