Using Web Archives To Improve Preservation Tools

We use Web Archive data to drive tool development.

So, according to our experiments, there are at least two thousand distinct formats in our historical web archive, and at least three thousand if you distinguish between all formats (e.g. HTML) and their specific versions and character sets (e.g. UTF-8). It’s worth pointing out that we don’t really know exactly how precise our identification tools are, and also that a small proportion remains unidentified. Therefore, we don’t yet know exactly how many formats there are. However, the contents of the archives are dominated by HTML, web image formats, PDF, Office documents and so on, and in the ‘tail’ of the distribution, thousands of formats accounts for a very small fraction of the content (not sure exactly, certainly less than 1%). Nevertheless, just because a format is rare, we cannot necessarily assume it is of little value, and I wonder if that is where the real preservation challenge for web archives lies. The ‘big’ formats look after themselves, and require little effort to preserve over moderate timescales. But in the tail of the format distribution, formats are less likely to survive without our intervention, and so we need to be able to work out where to invest our efforts.

“I ended up passing my snowballAnalyzer and standardAnalyzers as parameters to ShingleFilterWrappers and processing the outputs via a TermVectorMapper.”

n.b. the ‘long tail’ of file extensions in the JISC data set is a foul and hideous mess. Need to strip %xx from extensions, e.g. .php%3ffrom=1170 Seems #fragments are not being caught/stripped? Also ‘@’, ‘=’, ‘$’ among the ‘unlikely to be valid extension characters’ posse

In the ‘null’ set:

As of 15:11 on 24 Sept 2013: 295,991,273 16:16 295991273

Also c.f. Testing Software Tools of Potential Interest for Digital Preservation Activities at the National Library of Australia[0]=content_type_ext%3A%22.ico%22[0]=content_type_ext%3A%22.sib%22[0]=content_ffb%3A%220f534942%22[0]=content_type_ext%3A%22.mus%22&f[1]=content_ffb%3A%2252454d20%22


Mining for Signatures

Starting with unidentified formats:[0]=content_type%3A%22application/octet-stream%22, we can script a series of queries for different extensions that attempt to build plausible signatures for each, based on the FFB.[0]=content_type%3A%22application/octet-stream%22 application/octet-stream: 146,541[0]=content_type%3A%22null%22 null: 351,779 almost all OLE2 with a few ZIP.

11109 .s5, with P%000010 or L%000010 as FFB 5302 .ipx, with very similar FFB, sb%xx01/2/6 4394 .nwc, all with ‘[Not’ with a few ‘[NWZ’ FFB ( 3822 .dbf, with mostly ‘OPLD’ FFB and some other similar binary ones. 2482 .adx, with largely no FFB?

Psion 5

Starting at ‘application/octet-stream’ the most common unknown extension was .s5 - this had consistent magic (two FFB variants).

Firefox uses application/x-extension-EXT

An old 3D format.[0]=content_type_ext%3A%22.mus%22&f[1]=content_ffb%3A%2252454d20%22

MUS Finale content_type_ext:”mus” content_ffb:”454e4947”

Psion s5[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.s5%22

Psion MC/HC/Series 3 data files.


dBASE earlier versions




ArcFS, Spark, etc.

vim /usr/share/file/magic/archive > # Acorn archive formats (Disaster prone simpleton, [email protected]) > # I can't create either SPARK or ArcFS archives so I have not tested this stuff > # [GRR:  the original entries collide with ARC, above; replaced with combined > #  version (not tested)] > #0  byte    0x1a    RISC OS archive (spark format) > 0 string    \032archive RISC OS archive (ArcFS format) > 0       string          Archive\000     RISC OS archive (ArcFS format)
Spark, most match 0x1a(82 88 FF).

Followed by:

All these were taken from idarc, many could not be verified. Unfortunately,

there were many low-quality sigs, i.e. easy to trigger false positives.




MVR[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.mvr%22 %00%00%00%00 (498) sb%CC%01 (1)[0]=content_type%3A%22application/octet-stream%22&f[1]=content_ffb%3A%227362cc01%22


SPK[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.spk%22 MESSY %1A%FF!! (81) %1A%82!C (46) %1A%82!S (25) Arch (24)

PGC[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.pgc%22 PG%01%FF (283) PG%01%82 (35) PG%01%88 (20) PG%01%8E (18) PG%01%86 (15) …LOTS…

SIB[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.sib%22 %0FSIB (586)

INC[0]=content_type%3A%22application/octet-stream%22&f[1] MESS


.58[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.58%22 weird

.uk[0]=content_type%3A%22application/octet-stream%22&f[1] wierd

.shp[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.shp%22 Pretty clean

.wve[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.wve%22 clean ALaw

.t64[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.t64%22 nearly clean

.isl[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.isl%22 clean


NULLS[0]=content_type%3A%22application/octet-stream%22&f[1]=content_ffb%3A%2200000000%22 .mvr (498) .gsv (195) .sav (190) .ob1 (38) … LOTS …

Binary NGram Shingling

Along the same lines, experimenting with shingling the hex-encoded first few (32) bytes. We space separate and hex-encode the first 32 bytes of every resource. We pass that to Solr, which treats each hex-encoded byte as a single token. Solr then ‘shingles’ the tokens, from four to eight overlapping character sequences corresponding to all combinations of byte sequence between four and eight bytes long within the 32 bytes.

The total header size of 32 bytes, and the minimum and maximum shingle lengths of four and eight bytes, have been chosen in an attempt to reduce weak potential signatures (e.g. short byte sequences that might match too often by chance) with the significant storate requirement that arises due to indexing all possible shingles. For smaller collections, it would be possible to extend this technique to much longer shingles throughout whole length of the file.

Initial results from small corpus.

  • Long sequences of asterisks notable indicative of .js!
  • HTML/PDF signatures bear strong relation to manual ones, but generally spot more possible ‘signals’.
  • Not terribly useful as a Facet, due to presenting all shorter matching facets even when longer facets (that encapsulate the smaller ones) exist. May be possible to do some fancy facet filtering to make this more powerful.
  • Certainly, the field results could be mined and if the offset is know, shingles concatenated into longer ones as appropriate.

See also:

Indexing/Similarity Note:

We are interested in identifying the work, not making a substitutable work available. Crypto hashes are one way of doing this, but less precise hashing methods and signals can be combined to be just as specific, while also telling us something about the content, but while never giving the content away. Encrypted in plain sight.

Keyword spotting

Pull keywords out of source code formats? Just using the full-text index PLUS extensions. FUTURE consider token frequencies including punctuation.

Parse Error Analysis[0]=content_type%3A%22null%22






Fuzzy Hash Analysis

Need to show

  • different content making the same hash?
  • degree of change of a site over time, e.g. homepages.
  • finding similar content across domains?,%20java.lang.CharSequence,%20int%29

Data in warc-discovery/analysis-tools*%2Ccrawl_date&wt=xml&indent=true*%2Ccrawl_date&wt=xml&indent=true*%2Ccrawl_date&wt=xml&indent=true*%2Ccrawl_date&wt=xml&indent=true

Works well for catching small changes in wales and york. However, need to compare it with raw crypto hash of payload to see if the fuzzy has an advantage.

Case for the others is less clear, partially due to gaps in the record. Really need diffs/overlaps next to change graph.

Could also experiment with profiling location of edits to determine if changes are at top or bottom of the page.

Future work

Do this with NIST

On Tools
Creative Commons Licence
Keeping Codes
by Andrew N. Jackson
comments powered by Disqus