What are some ways to automatically generate descriptive metadata for warcs?

What are some ways to automatically generate descriptive metadata for warcs, or what are the best tools for parsing warcs?

I'm looking to generate as much descriptive metadata (DC) as possible for a given crawl to then be ingested into a repository.

I've come across the Internet Archive's warc, a Python library, and warc-tools, another Python library.

warc looks like it can put out a fair bit of what could be used as descriptive metadata. But, what about parsing some actual html tags (e.g., <title>foo</title>)?

ruebot

web-archiving
metadata

Comments

Nick Krabbenhoeft: What kind of descriptive metadata are you looking for? Going down the list of Dublin Core, title and date might be the only html tags you can pull out consistently with some scripting. Language, type, and creator might be possible with more complicated tools.
ruebot: That is basically it. Date is easy ['WARC-Date']. Source is easy ['WARC-Target-URI']. Title, is there in html title tags. Language, intuitive I presume. Create/contributor could be snagged with some creative if statements depending on the site. Type would be automatically generated, along with most of the remaining fields.
ruebot: Also, maybe I answered the question myself when I was asking it. I was just curious if anybody else had done any work like this, and what their workflows/tools/whatever looked like.
Andy Jackson: I disagree with how easy you think this is. That date gives you the crawl date, which is probably not what you really mean. What about multiple crawls? Can we work out the 'publication date'? Filling out the fields, with decent quality data, at scale, is not very easy.

Answer by raffaele messuti

refer to this discussion on library.stackexchange : Characterization of WARC files contents?

Comments

ruebot: I'd argue since that is characterisation, it is focused at technical metadata. I'm looking for descriptive metadata.
Andy Jackson: I've updated that post to reflect the code we've moved around. We also mine WARCs for descriptive metadata - see warc-discovery, although at the moment we've only really had much luck with the Author field.

Answer by raffaele messuti

i played a bit with warc. with the following python script (it's quick and dirty) you can analyse all response records with tika, and save the json output in a directory (files named as record-uuid.json)

for html content the result is good, otherwise images are recognized as application/octet-stream. i guess that is record.payload including also http headers

import warc
import subprocess
import sys

if len(sys.argv) < 2:
    sys.exit('Usage: %s warcfile' % sys.argv[0])

warcfile = sys.argv[1]

f = warc.open(warcfile)
for record in f:
    if record.header.type == "response":
        uuid = record.header.record_id.split(":")[2][:-1]

        process = subprocess.Popen(["tika", "-m", "-j"], 
            stdin=subprocess.PIPE, stdout=subprocess.PIPE)

        process.stdin.write("{}\n".format(record.payload.read()))
        process.stdin.flush()
        process.stdin.close()

        out = open("metadata/{}.json".format(uuid), "w")
        out.write(process.stdout.read())
        print uuid
        out.close
        process.wait()