What makes a large website large?

Posted: 2023-06-09 | My WARCs XL

Recently, the Library of Congress’s excellent Web Archive Team sent out a request to gather information about how web archives cope with large websites. One of the questions was about how we define what makes a website “large”. This is my attempt to answer that question, and outline how we’ve adapted our crawling to cope with large sites…

Of course, the primary factors that make a site “large” are the number of URLs to be archived, and the sizes of those resources. However, as we don’t archive sites that use a lot of audio and video, we find the size is less important than the number of URLs.

But the crawl is critically dependent on two additional factors. How fast can we crawl? And how quickly does the content change?

Usually, to be polite to websites, we restrict our crawlers to downloading one URL at a time, and at a rate of no more than one or two URLs a second. Therefore, a site becomes “large” when it is no longer possible to crawl it in a “reasonable” amount of time.

The classic example for us is a new site like BBC News. That site has a LOT of URLs, and as such a full crawl can take weeks or months. Which brings us to the last factor - the rate of change of the site itself.

Traditional crawling workflows are often based around setting overall crawl frequencies. e.g. we crawl this site once a day, this other site once a month, and so on. What frequency should we use for a news site?

Clearly, at least once per day, because the site keeps changing! But we can’t possibly crawl the whole site in a day! We need to run the crawl for WEEKS!

In short, we know a site is “large” when we hit this conflict between freshness and completeness.

The simplest way to deal with this risk of temporal incoherence is to have two crawls. A shallow and frequent crawl to get the most recent material, with a longer-running deeper crawl to gather the rest. However, this does risk overloading the websites themselves, and the total volume of crawling may be considered impolite.

This is why for the UK Web Archive we modified the behaviour of the Heritrix crawler so that it can run crawls for many weeks, but seeds and sitemaps are re-fetched daily so that those URLs are up to date and any new URLs are discovered. This ensures a steady and controlled crawl rate, but allows the ‘fresh’ URLs to be prioritised.

With hindsight, whether this elegance is worth the additional complexity and customisation of the crawler is less clear. As we look to how we might adopt Browsertrix Crawler/Cloud in the future, rather than make the tools more complex, perhaps it’s better to risk being impolite and just run parallel crawls?

Next in series: Reflections on the IIP... » « Previous in series: UK Web Archive Technic...

Web Archives

Building Web Archives

Describing our evolving web archiving framework and tools. In particular, the aim is to document how our crawl architecture has evolved to become more modular, and to explore the idea of using APIs to make these systems more manageable over time.

Blog Series

  Building Web Archives 16

  Digital Preservation Lessons Learned 7

  Digital Dark Age 7

  Format Identification 3

  Format Registries 6

  Mining Web Archives 17

Recent Posts

Tags

Websites (13) Travels (47) General (1) Development (7) Top Tips (4) Science (7) Rants (3) Top Links (2) Reviews (2) Visualisation (3) Digital Preservation (45) Procrastination (2) Data Mining (16) Open Access (1) Web Archives (34) Representation Information (2) Format Registry (4) SCAPE (3) webarchive-discovery (7) War Stories (1) Preservation Actions (2) BUDDAH (5) Publications (3) Digital Humanities (1) Collaboration (1) Keeping Codes (6) Lessons Learned (6) Reports (5)


Posted: 2023-06-09 | anj

 

Fighting entropy since 1993

© Dr Andrew N. Jackson — CC-BY

Elsewhere

Contact