Having pulled together an initial version of the iPRES proceedings dataset, the next question is: How can we make this data easier to access and use?
And based on our principles, how can we do this in a way that lets people explore what the possible use cases might be, without requiring a lot of up-front development or maintenance? Can we build systems in ways that minimise system costs and delay obsolescence?
For me, part of digital preservation is being on the look out for tools and technologies that are in the process of becoming widely-used standards. The kind of things that become so ubiquitous that they start getting baked into the infrastructure of our digital lives. Like formats that become so widespread, network effects can stave off obsolescence.
Of course, even ubiquitous formats die sometimes. But it is relatively easy to spot new formats and technologies that are likely to have a shelf-life much longer than, say, five years. If only by looking at the ones that have already been around for a decade or two.
iPRES as a Database#
The simplest possible thing I could think of was publishing the iPRES data set as a literal database, and no database format is more widely-used than SQLite (born c. 2000). By allowing complex data structures and rich query support in a single file, SQLite has become one of the most popular software applications in the world. Not only is it supported by almost every programming language out there, there are also a huge range of tools that can work with it.
So, by repackaging the data as SQL tables in an SQLite database, you can load the data into something like Apache Superset, Metadata, or Datasette and start exploring it. Or use standard Python tools. Or database viewers like this one.
I particularly like Datasette myself, particularly as there is a supported version that can run entirely in your browser, called Datasette Lite. The author of Datasette also provides lots of tools and examples of how to do this kind of thing, which made it even easier to turn my standardised JSON data into a useful database.
iPRES as a Website#
The SQLite version is a great for distributing and analysing the dataset, but I also wanted to explore other use cases. In part, I wanted something that was a bit simpler to search, as the Datasette user interface is quite complex and takes a while to start up. But more importantly, I want to explore the idea of giving the iPRES Proceedings some kind of consistent web presence.
As well as making it easy to browse and search the collection, this could expose the collected metadata in forms that other web-oriented tools can use. For example, by following these rules for embedded metadata, it’s possible to find a page for an article and then import it into citation management tools like Zotero. This is something that does not appear to be supported by the platforms that are hosting the proceedings, but is relatively easy to add as a discovery layer on top of that infrastructure.
Better still, that same embedded metadata can also be used by search engines. Giving every iPRES contribution a structured web page should make it possible to find all of iPRES via Google, and perhaps fill in the gaps in literature search tools like Internet Archive Scholar, Google Scholar, OpenAlex and BASE.
Again, to do this with as little overhead and maintenance load as possible, I looked to use standard formats and widespread tools. Static web sites have become very widely used in recent years (see e.g. the JAM Stack), including the practice of storing structured and unstructured information together in the form of Markdown files (c. 2004) with structured ‘front matter’ to hold the data (c. 2008), usually stored on a version control platform like GitHub (c. 2007).
While that are some minor variations between the handful of different flavours of Markdown, these are not too difficult to cope with. Certainly not enough to offset the huge range of tools that support this format.
And with this established format as our medium, we can avoid running a complicated stack to handle editing and publishing the data. We can use any combination of tools or content management systems that support updating Markdown files in GitHub (or just edit the files directly). We can also use any of the large number of tools that support generating websites from Markdown files. Either side of this process can change over time, without affecting the other.1
flowchart LR data[[Publications\nData]] data-- Markdown\nGenerator-->store edit[[Additions\n& Corrections]] edit-- Markdown\nEditor --> store store[[Markdown\nin GitHub]] store-- Static Site\nGenerator -->www www[[HTML & JS\nWebsite]]
At this point, we are focusing on generating the proceedings files from source data, so that doesn’t need any kind of user interface. It’s just code.
On the website-generation side, I decided to start with perhaps the oldest and most widely known and used static site generator tool, Jekyll. I’ve used it before, and it’s very flexible and capable, and it’s fairly well known already in the library and archives sector (see for example these code4lib journal papers).
It also has quite a wide range of off-the-shelf themes available. This is important at this stage in this project, because I want people to be able to play with the ideas and explore the use cases without having to do a huge amount of up-front work to get there. It’s fine to have an interface that’s a bit clunky or basic, as long as it’s good enough to play with and learn from.
I spent a little time looking around at some of the available documentation themes, because documentation sites are quite a common pattern that can be adapted to support the shape of our proceedings data. This eventually led me to the Just The Docs theme which looks pretty nice, appears to have decent accessibility features, supports two-level hierarchies, provides some nice layout tools and standard components, and has a built-in browser-based search system that is easy to customise.
The DigiPres Publications Prototype#
Bringing the web version and the database together, we finally get to the finished first release of the DigiPres Publications Index, available at: https://www.digipres.org/publications/
I’d be very keen to hear any and all feedback about all these ideas, via mailto:[email protected] or any of the contact icons on my homepage.
The database version seems to have been particularly successful, as I’ve had a number of people get in touch to say that it helped them find useful publications from the past. And quite a few people have had things to say about (f|)utility of the keywords associated with publications, which have been used in very different ways over the years.
The iPRES Steering Group and people in involved with both iPRES 2023 and iPRES 2024 have also been in touch to support the work and ask how they might be able to help. This is very much appreciated, but we think a little more exploration and feedback is needed first. In the next post, we’ll look at some of the options for how things might go in the future, and what would be needed to support them.
See How do you cut a monolith in half for a detailed discussion of how establishing clear protocols can help build more manageable systems. ↩︎