Revisiting Web Rings
Recently, for no particular reason I’m sure, there seems to have been a renewed interest in more distributed and community-oriented ways of finding good stuff on the web. In particular, the ancient idea of web rings seems to keep coming back (as illustrated by the webring page on the IndieWeb wiki).
I particularly like the look of this Webring Kit, but a recent Twitch stream by @whitep4nth3r about The Claw Webring made me realise I’d forgotten exactly how web rings used to look and function back in the 90’s.
Can I use web archives to find out?
Read More
Posted: 2022-12-17
Websites
UK Web Archive Technical Update - Autumn 2022
This is a summary of what’s been going on since the update at the start of the summer.
Read More
Posted: 2022-10-18
Reports
Web Archives
UK Web Archive Technical Update - Summer 2022
Following on from the last quarterly update, we’ve been able to make some good progress despite being understaffed during this period.
Read More
Posted: 2022-07-11
Reports
Web Archives
UK Web Archive Technical Update - Spring 2022
This is a summary of what’s been going on since the last update, at the start of the year.
Read More
Posted: 2022-05-17
Reports
Web Archives
UK Web Archive Technical Update - Winter 2022
During the last quarter of 2021, the technical services that make up the web archive underwent lot of changes behind the scenes. These changes should help us to improve our services, so it’s worth explaining a little about what’s been going on.
Read More
Posted: 2022-01-06
Reports
Web Archives
Continuous, incremental, scalable, higher-quality web crawls with Heritrix
Abstract
Under Legal Deposit, our crawl capacity needs grew from a few hundred time-limited snapshot crawls to the continuous crawling of hundreds of sites every day, plus annual domain crawling. We have struggled to make this transition, as our Heritrix set-up was cumbersome to work with when running large numbers of separate crawl jobs, and the way it managed the crawl process and crawl state made it difficult to gain insight into what was going on and harder still to augment the process with automated quality checks. To attempt to address this, we have combined three main tactics; we have moved to containerised deployment, reduced the amount of crawl state exclusively managed by Heritrix, and switched to a continuous crawl model where hundreds of sites can be crawled independently in a single crawl. These changes have significantly improved the quality and robustness of our crawl processes, while requiring minimal changes to Heritrix itself. We will present some results from this improved crawl engine, and explore some of the lessons learned along the way.
Introduction
Since we shifted to crawling under Legal Deposit in 2013, the size and complexity of our crawling has massively increased. Instead of crawling a...
Read More
Posted: 2018-11-13
Web Archives
Digital Preservation