Blog

Revisiting Web Rings

Recently, for no particular reason I’m sure, there seems to have been a renewed interest in more distributed and community-oriented ways of finding good stuff on the web. In particular, the ancient1 idea of web rings seems to keep coming back (as illustrated by the webring page on the IndieWeb wiki).

I particularly like the look of this Webring Kit, but a recent Twitch stream by @whitep4nth3r about The Claw Webring made me realise I’d forgotten exactly how web rings used to look and function back in the 90’s.

Can I use web archives to find out?

Read More

Posted: 2022-12-17

Websites

Continuous, incremental, scalable, higher-quality web crawls with Heritrix

Abstract

Under Legal Deposit, our crawl capacity needs grew from a few hundred time-limited snapshot crawls to the continuous crawling of hundreds of sites every day, plus annual domain crawling. We have struggled to make this transition, as our Heritrix set-up was cumbersome to work with when running large numbers of separate crawl jobs, and the way it managed the crawl process and crawl state made it difficult to gain insight into what was going on and harder still to augment the process with automated quality checks. To attempt to address this, we have combined three main tactics; we have moved to containerised deployment, reduced the amount of crawl state exclusively managed by Heritrix, and switched to a continuous crawl model where hundreds of sites can be crawled independently in a single crawl. These changes have significantly improved the quality and robustness of our crawl processes, while requiring minimal changes to Heritrix itself. We will present some results from this improved crawl engine, and explore some of the lessons learned along the way.

Introduction

Since we shifted to crawling under Legal Deposit in 2013, the size and complexity of our crawling has massively increased. Instead of crawling a...

Read More

Posted: 2018-11-13

Web Archives  Digital Preservation

 

Fighting entropy since 1993

© Dr Andrew N. Jackson — CC-BY

Elsewhere

Contact