We use Web Archive data to drive tool development.
So, according to our experiments, there are at least two thousand distinct formats in our historical web archive, and at least three thousand if you distinguish between all formats (e.g. HTML) and their specific versions and character sets (e.g. UTF-8). It’s worth pointing out that we don’t really know exactly how precise our identification tools are, and also that a small proportion remains unidentified. Therefore, we don’t yet know exactly how many formats there are. However, the contents of the archives are dominated by HTML, web image formats, PDF, Office documents and so on, and in the ‘tail’ of the distribution, thousands of formats accounts for a very small fraction of the content (not sure exactly, certainly less than 1%). Nevertheless, just because a format is rare, we cannot necessarily assume it is of little value, and I wonder if that is where the real preservation challenge for web archives lies. The ‘big’ formats look after themselves, and require little effort to preserve over moderate timescales. But in the tail of the format distribution, formats are less likely to survive without our intervention, and so we need to be able to work out where to invest our efforts.
“I ended up passing my snowballAnalyzer and standardAnalyzers as parameters to ShingleFilterWrappers and processing the outputs via a TermVectorMapper.” http://searchhub.org/2009/05/26/accessing-words-around-a-positional-match-in-lucene/
n.b. the ‘long tail’ of file extensions in the JISC data set is a foul and hideous mess. http://192.168.1.151:8984/solr/#/jisc/schema-browser?field=content_type_ext Need to strip %xx from extensions, e.g. .php%3ffrom=1170 Seems #fragments are not being caught/stripped? Also ‘@’, ‘=’, ‘$’ among the ‘unlikely to be valid extension characters’ posse
In the ‘null’ set: http://192.168.1.151:8984/solr/#/jisc/query?q=content_type_ext:%22.wps%22
As of 15:11 on 24 Sept 2013: 295,991,273 16:16 295991273
http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type_ext%3A%22.ico%22
http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type_ext%3A%22.sib%22 http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_ffb%3A%220f534942%22 http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type_ext%3A%22.mus%22&f[1]=content_ffb%3A%2252454d20%22
Language: http://www.webarchive.org.uk/aadda-discovery/browse?f[0]=links_public_suffixes%3A%22fr%22&f[1]=content_language%3A%22fr%22
Mining for Signatures
Starting with unidentified formats: http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22, we can script a series of queries for different extensions that attempt to build plausible signatures for each, based on the FFB.
http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22 application/octet-stream: 146,541 http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22null%22 null: 351,779 almost all OLE2 with a few ZIP.
11109 .s5, with P%000010 or L%000010 as FFB 5302 .ipx, with very similar FFB, sb%xx01/2/6 4394 .nwc, all with ‘[Not’ with a few ‘[NWZ’ FFB (http://www.noteworthysoftware.com/player/) 3822 .dbf, with mostly ‘OPLD’ FFB and some other similar binary ones. 2482 .adx, with largely no FFB?
Psion 5
Starting at ‘application/octet-stream’ the most common unknown extension was .s5 - this had consistent magic (two FFB variants).
- psiconv may be able to extract info/text.
- s5 Header Format
Firefox uses application/x-extension-EXT
An old 3D format. http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type_ext%3A%22.mus%22&f[1]=content_ffb%3A%2252454d20%22
MUS Finale content_type_ext:”mus” content_ffb:”454e4947”
Psion s5 http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.s5%22
Psion MC/HC/Series 3 data files. ftp://ftp.cs.tu-berlin.de/pub/palmtops/psion/src.doc.ic-mirror/Unsorted/mcfile.txt http://www.sat.dundee.ac.uk/~arb/psion/
DBF www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.dbf%22
dBASE http://www.dbase.com/Knowledgebase/INT/db7_file_fmt.htm earlier versions http://www.clicketyclick.dk/databases/xbase/format/dbf.html#DBF_STRUCT http://www.clicketyclick.dk/databases/xbase/format/dbf.html#DBF_STRUCT
IPIX http://en.wikipedia.org/wiki/IPIX http://www.ipix.com/support/downloads.cfm www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.ipx%22
AXD www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.axd%22
NWC www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.nwc%22
http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type_full%3A%22application/zip%22
ArcFS, Spark, etc.
vim /usr/share/file/magic/archive > # Acorn archive formats (Disaster prone simpleton, [email protected]) > # I can't create either SPARK or ArcFS archives so I have not tested this stuff > # [GRR: the original entries collide with ARC, above; replaced with combined > # version (not tested)] > #0 byte 0x1a RISC OS archive (spark format) > 0 string \032archive RISC OS archive (ArcFS format) > 0 string Archive\000 RISC OS archive (ArcFS format)
Spark, most match 0x1a(82 | 88 | FF). |
Followed by:
All these were taken from idarc, many could not be verified. Unfortunately,
there were many low-quality sigs, i.e. easy to trigger false positives.
http://www.bttr-software.de/freesoft/arc2.htm#idarc
SPSS http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_served%3A%22application/spss%22
MatLab http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_served%3A%22application/matlab%22
VQF/TWIN http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_ffb%3A%225457494e%22
MVR http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.mvr%22 %00%00%00%00 (498) sb%CC%01 (1)
http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_ffb%3A%227362cc01%22
SNP http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.snp%22
SPK http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.spk%22 MESSY %1A%FF!! (81) %1A%82!C (46) %1A%82!S (25) Arch (24)
PGC http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.pgc%22 PG%01%FF (283) PG%01%82 (35) PG%01%88 (20) PG%01%8E (18) PG%01%86 (15) …LOTS…
SIB http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.sib%22 %0FSIB (586)
INC http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.inc%22 MESS
RPM http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.rpm%22
.58 http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.58%22 weird
.uk http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.uk%22 wierd
.shp http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.shp%22 Pretty clean
.wve http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.wve%22 clean ALaw
.t64 http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.t64%22 nearly clean
.isl http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_type_ext%3A%22.isl%22 clean
Containers
NULLS http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22application/octet-stream%22&f[1]=content_ffb%3A%2200000000%22 .mvr (498) .gsv (195) .sav (190) .ob1 (38) … LOTS …
Binary NGram Shingling
Along the same lines, experimenting with shingling the hex-encoded first few (32) bytes. We space separate and hex-encode the first 32 bytes of every resource. We pass that to Solr, which treats each hex-encoded byte as a single token. Solr then ‘shingles’ the tokens, from four to eight overlapping character sequences corresponding to all combinations of byte sequence between four and eight bytes long within the 32 bytes.
The total header size of 32 bytes, and the minimum and maximum shingle lengths of four and eight bytes, have been chosen in an attempt to reduce weak potential signatures (e.g. short byte sequences that might match too often by chance) with the significant storate requirement that arises due to indexing all possible shingles. For smaller collections, it would be possible to extend this technique to much longer shingles throughout whole length of the file.
Initial results from small corpus.
- Long sequences of asterisks notable indicative of .js!
- HTML/PDF signatures bear strong relation to manual ones, but generally spot more possible ‘signals’.
- Not terribly useful as a Facet, due to presenting all shorter matching facets even when longer facets (that encapsulate the smaller ones) exist. May be possible to do some fancy facet filtering to make this more powerful.
- Certainly, the field results could be mined and if the offset is know, shingles concatenated into longer ones as appropriate.
See also:
- http://www.forensicswiki.org/wiki/File_Format_Identification
- A New Approach to Content-based File Type Detection (2008)
- Predicting the types of file fragments (2008)
- Fast File-type Identification (2010)
- Fast Content-Based File Type Identification (2011)
Indexing/Similarity Note:
We are interested in identifying the work, not making a substitutable work available. Crypto hashes are one way of doing this, but less precise hashing methods and signals can be combined to be just as specific, while also telling us something about the content, but while never giving the content away. Encrypted in plain sight.
Keyword spotting
Pull keywords out of source code formats? Just using the full-text index PLUS extensions. FUTURE consider token frequencies including punctuation.
Parse Error Analysis
http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type%3A%22null%22
http://192.168.1.206:8990/solr/#/jisc/schema-browser?field=parse_error
Encryption:
PDF:
- http://192.168.1.182:8989/solr/jisc3/select?q=parse_error%3A%22org.apache.pdfbox.exceptions.CryptographyException%3A+Error%3A+The+supplied+password+does+not+match+either+the+owner+or+user+password+in+the+document.%22&wt=json&indent=true
Office:
- http://192.168.1.182:8989/solr/jisc3/select?q=parse_error%3A%22org.apache.poi.EncryptedDocumentException%3A+Cannot+process+encrypted+word+file%22&wt=json&indent=true
- http://192.168.1.182:8989/solr/jisc3/select?q=parse_error%3A%22org.apache.poi.EncryptedDocumentException%3A+Default+password+is+invalid+for+docId%2FsaltData%2FsaltHash%22&wt=json&indent=true
- http://192.168.1.182:8989/solr/jisc3/select?q=parse_error%3A%22org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException%3A+The+CurrentUserAtom+specifies+that+the+document+is+encrypted%22&wt=json&indent=true
- http://192.168.1.182:8989/solr/jisc3/select?q=parse_error%3A%22org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException%3A+Encrypted+PowerPoint+files+are+not+supported%22&wt=json&indent=true
Fuzzy Hash Analysis
Need to show
- different content making the same hash?
- degree of change of a site over time, e.g. homepages.
- finding similar content across domains?
http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringUtils.html#getLevenshteinDistance%28java.lang.CharSequence,%20java.lang.CharSequence,%20int%29
Data in warc-discovery/analysis-tools
www.york.ac.uk http://192.168.1.206:8990/solr/jisc/select?q=url%3A%22http%3A%2F%2Fwww.york.ac.uk%3A80%2F%22&sort=crawl_date+asc&rows=50&fl=ssdeep_hash_bs_*%2Ccrawl_date&wt=xml&indent=true
www.amazon.co.uk http://192.168.1.206:8990/solr/jisc/select?q=url%3A%22http%3A%2F%2Fwww.amazon.co.uk%3A80%2F%22&sort=crawl_date+asc&rows=50&fl=ssdeep_hash_bs_*%2Ccrawl_date&wt=xml&indent=true
www.bbc.co.uk http://192.168.1.206:8990/solr/jisc/select?q=url%3A%22http%3A%2F%2Fwww.bbc.co.uk%3A80%2F%22&sort=crawl_date+asc&rows=50&fl=ssdeep_hash_bs_*%2Ccrawl_date&wt=xml&indent=true
www.wales.gov.uk http://192.168.1.206:8990/solr/jisc/select?q=url%3A%22http%3A%2F%2Fwww.wales.gov.uk%3A80%2F%22&sort=crawl_date+asc&rows=50&fl=ssdeep_hash_bs_*%2Ccrawl_date&wt=xml&indent=true
Works well for catching small changes in wales and york. However, need to compare it with raw crypto hash of payload to see if the fuzzy has an advantage.
Case for the others is less clear, partially due to gaps in the record. Really need diffs/overlaps next to change graph.
Could also experiment with profiling location of edits to determine if changes are at top or bottom of the page.
Future work
Do this with NIST
http://www.nsrl.nist.gov/nsrl-faqs.html#faq19
On Tools