The processes starts when we load seeds in using the Strategy Worker:
python -m frontera.worker.strategy --config huntsman.config.sw --add-seeds --seeds-url file:///Users/andy/absolute-path/seeds.txt
This scores the seed URLs and places them on the frontier-score
queue. A DB worker processes these incoming, scored URLs:
python -m frontera.worker.db --config huntsman.config.dbw --partitions 0 1 --no-batches
This reads the frontier-score
queue and pushes the content into the queue
table of the database. A separate DB worker:
python -m frontera.worker.db --config huntsman.config.dbw --partitions 0 1 --no-incoming
…reads the queue
table and breaks the prioritised queue down into batches to be sent to the crawlers, posting them onto the frontier-todo
queue. For each partition we have a crawler instance:
scrapy crawl crawl-test-site -L INFO -s SPIDER_PARTITION_ID=0
The spiders download the URLs and extract the links. The results are posted onto the frontier-done
queue, as a stream of different events. There are page-crawled
events, links-extracted
events (where one message lists all the URLs from one response), and offset
events that indicate where the spiders have got to in the queue partition they are processing.
(AFIACT) the DB workers and the Strategy Worker:
python -m frontera.worker.strategy --config huntsman.config.sw --partitions 0 1
…all read the frontier-done
queue, and update the state they are responsible for accordingly. Tasks that get done are:
metadata
table to refect that they’ve been downloads. (Incoming DB worker?)offset
events are used to keep track of where the spiders have got to (Batching DB worker)frontier-score
queue, and the cycle continues.Overall, Frontera has lots of good ideas to learn from, but is also somewhat confusing and the documentation appears to be out of date (probably just by one release). Using different message types in a single single stream is rather clumsy – Kafka’s design (e.g keys and compaction) and my preference leans towards having separate queues for different message types.
Here at the UK Web Archive, we’re very fortunate to be able to work in the open, with almost all code on GitHub. Some of our work has been taken up and re-used by others, which is great. We’d like to encourage more collaboration, but we’ve had trouble dedicating time to open project management, and our overall management process and our future plans are unclear. For example, we’ve experimented with so many different technologies over the years that our list of repositories give little insight into where we’re going next. There are also problems with how issues and pull-requests have been managed, often languished waiting for us to get around to looking at them. This also applies to the IIPC repositories and other projects we are involved in, as well as the projects we lead.
I wanted to block out some time to deal with these things promptly, but also to find a way of making it a bit more, well, a bit more fun. A bit more social. Some forum where we can chat about our process and plans without the formality of having to write things up.
Taking inspiration from Jason Scott live-streamed CD-ripping sessions, we came up with the idea of Open Source Office Hours – some kind open open video conference or live stream, where we’ll share our process, discuss issues relating to open source projects and have a forum where anyone can ask questions about what we’re up to. This should raise the profile of our open source work both inside and outside our organisation, and encourage greater adoption of, and collaboration around, open source tools.
All welcome, from lurkers to those brimming with burning questions. Just remember that being kind beats being right.
Anyone else who manages open source projects like ours is also welcome to join and take the lead for a while! I can only cover the projects we’re leading, but there’s many more.
The plan is to launch the first Office Hours session on the 22nd of May, and then hold regular weekly slots every Tuesday from then on. We may not manage to run it every single week, but if it’s regular and frequent that should mean we can cope more easily with missing the odd one or two.
On the 22nd, we will run two sessions - one in the morning (for the west-of-GMT time-zones) and one in the evening (for the eastern half). Following that, we intend to switch between the two slots, making each a.m. and p.m. slot a fortnightly occurance.
For more details, see the IIPC Trello Board card
At the UK Web Archive, we believe in working in the open, and that organisations like ours can achieve more by working together and pooling our knowledge though shared practices and open source tools. However, we’ve come to realise that simply working in the open is not enough – it’s relatively easy to share the technical details, but less clear how to build real collaborations (particularly when not everyone is able to release their work as open source).
To help us work together (and maintain some momentum in the long gaps between conferences or workshops) we were keen to try something new, and hit upon the idea of Online Hours. It’s simply a regular web conference slot (organised and hosted by the IIPC, but open to all) which can act as a forum for anyone interested in collaborating on open source tools for web archiving. We’ve been running for a while now, and have settled on a rough agenda:
This gives the meetings some structure, but is really just a starting point. If you look at the notes from the meetings, you’ll see we’ve talked about a wide range of technical topics, e.g.
The meeting is weekly, but we’ve attempted to make the meetings inclusive by alternating the specific time between 10am and 4pm (GMT). This doesn’t catch everyone who might like to attend, but at the moment I’m personally not able to run the call at a time that might tempt those of you on Pacific Standard Time. Of course, I’m more than happy to pass the baton if anyone else wants to run one or more calls at a more suitable time.
If you can’t make the calls, please conider:
My thanks go to everyone who as come along to the calls so far, and to IIPC for supporting us while still keeping it open to non-members.
Maybe see you online?
Technology Strategy
https://www.iota.org/research/meet-the-tangle Blockchains etc
The CMX 600 of Digital Content Management
In 1971 the CMX 600 began a revolution by being the very first non-linear video editing system.
(see snapshots on desktop)
There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors. (@codinghorror)[https://twitter.com/codinghorror/status/506010907021828096?lang=en]
PDF encryption as a ‘lessons learned’ blog post
I wonder if some of those opposed to compression also avoid using databases as master stores for critical metadata?
Lessons Learned
It starts with Save As…
Google Doc does not fit in OAIS
DigiPres and the big niches - 3D, maths, ?
Serialisation, marshalling, pickling, parsing,
What a good file format looks like (from a preservation perspective):
Web Packging Exists!
Language as critical as it/dp boundsary is tense Format obsolescence is not as urgent and terrible as advertised Bit preservation is not solved We don’t need registries for truth We don’t need registries for what to do, we need each other. It takes a village…
Prescriptive v Descriptive Linguistics If we only accept valid PDFs, we are saying. We know best. We understand PDF and we know what’s worth preserving. In this analogy: We know what’s best. We understand the Queen’s English and any documents will need to be translated before we archive them HERE. Speculation on easy/’preservable’ formats and correlated use with different social/economic subgroups.
Every format has two faces/parsing is hard
When you press save, what happens? Lots of decision. It’s not just the document. What about the undo buffer? What about the window position? What about the language? These can be installation/user/file level, it depends.
Format standardisation is precisely a designated community coming together to determine which properties of a Digital (Performance) Object they want to preserve over time.
Compression is Fraught: I wonder if some of those opposed to compression also avoid using databases as master stores for critical metadata?
Sheridan Code, versus the Models. Man made, use the language of what it is. No (further) abstraction necessary and in fact it gets in the way. NLA model only good because it gets closer.
Herbert CNI talk. Scholarly communication fragments. Atmosphere at iDCC? How physics worked at different places. How biology worked. Thoughts on adoption. Thoughts on costs.
OAIS is for talking about digital preservation, not for doing it. I think OAIS is better suited to talking about doing digital preservation than helping get preservation done. It deliberately floats above
Flexibility versus the principle of least power
SIPs as DIPs etc https://twitter.com/euanc/status/922520776384962560 But it’s not a DIP if it’s not leaving your org
MPLP and layers?
But constant work is required to generate the illusion of a stable image. Fifty times a second, the ULA reads the screen memory and sends the result out to the television screen, and this process is interlaced with the data loading as the whole thing hums along at 3.5MHz (about a thousand times slower than a modern machine).
OODT is closest thing to ref impl of OAIS
A light-weight pre-premis/mets would be very useful. Extend bagit with orthogonal data in linewise files
Open source and digital preservation
Poor cohesion of reading lists
More automation possibilities e.g. UI script Acrobat Reader to test, I.e. Simulate user interactions in
Validation, nope Validation, fast and thorough
How to help practitioners help?! http://anjackson.net/2016/06/08/frontiers-in-format-identification/#comment-2723081343
Re-deriving significant properties Note that OAIS Ingest as odd notion Normalisation as prejudice
PRONOM can be embedded in mime tree
It’s also been interesting to compare the web archiving community with the broader digital preservation community. There’s many familiar faces due to the strong overlap between the fields, but there’s also a stronger sense of a unified vision, a preference for practical results, and a more constructive colllaboration between researchers and content-holding organisations. On the flip-side, there is something of a silo effect, where web archive material is often poorly integrating into the bigger picture, both in the abstract (e.g. the bigger picture of digital preservation) and the concrete (e.g. we’re only just learning how to integrate web archives with our other holdings).
PDF encryption is weird
dd if=/dev/zero bs=1M count=1000 | openssl dgst -sha512
# dd if=/dev/zero bs=1M count=1000 | openssl dgst -sha512
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.94019 s, 266 MB/s
(stdin)= a7d483bb9af2ca4b064420d1911d9116b6b609ca312fd7ed919fc1b8be7d1eb57c46f2a6f13380b6dc38f024d17442b4c7b8ecb8c121dc88227d588fc2e04297
# hdparm -tT /dev/sda1
/dev/sda1:
Timing cached reads: 17696 MB in 2.00 seconds = 8861.16 MB/sec
Timing buffered disk reads: 332 MB in 3.01 seconds = 110.42 MB/sec
[root@crawler06 python-shepherd]# dd if=/dev/zero bs=1M count=1000 > /dev/null
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.0997883 s, 10.5 GB/s
MP3
Is it possible? i.e.
Even if is possible, it is feasible?