The Zombie Stack Exchanges That Just Won't Die
What are some ways to automatically generate descriptive metadata for warcs, or what are the best tools for parsing warcs?
I'm looking to generate as much descriptive metadata (DC) as possible for a given crawl to then be ingested into a repository.
I've come across the Internet Archive's warc, a Python library, and warc-tools, another Python library.
warc looks like it can put out a fair bit of what could be used as
descriptive metadata. But, what about parsing some actual html tags
(e.g., <title>foo</title>
)?
ruebot
refer to this discussion on library.stackexchange : Characterization of WARC files contents?
i played a bit with warc. with the following python script (it's quick and dirty) you can analyse all response records with tika, and save the json output in a directory (files named as record-uuid.json)
for html content the result is good, otherwise images are recognized as application/octet-stream. i guess that is record.payload including also http headers
import warc
import subprocess
import sys
if len(sys.argv) < 2:
sys.exit('Usage: %s warcfile' % sys.argv[0])
warcfile = sys.argv[1]
f = warc.open(warcfile)
for record in f:
if record.header.type == "response":
uuid = record.header.record_id.split(":")[2][:-1]
process = subprocess.Popen(["tika", "-m", "-j"],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
process.stdin.write("{}\n".format(record.payload.read()))
process.stdin.flush()
process.stdin.close()
out = open("metadata/{}.json".format(uuid), "w")
out.write(process.stdout.read())
print uuid
out.close
process.wait()