What tools can we use to appraise content before digital preservation?

Short-term storage seems cheap, but long-term storage for digital preservation is expensive. Part of the solution to this problem is using archival appraisal to identify what content should be preserved and what content can be discarded, but how do we appraise gigabytes or terabytes of data?

Visualization tools like WinDirStat and SpaceSniffer let you scan a folder structure quickly to prioritize potentially redundant data (e.g. system files) or core content (e.g. My Documents). Other tools like C3PO let you survey technical metadata, allowing you to look at rough estimates of format types to see what formats a creator used the most. Are there other tools used to quickly appraise data?

Nick Krabbenhoeft

storage
born-digital
data-curation

Comments

mopennock: Could you clarify the purpose of the appraisal, other than reducing storage space? The purpose will make it easier for people to identify appropriate tools.
Nick Krabbenhoeft: I changed the wording to "identify what content should be preserved and what content can be discarded." Does that clarify it enough? I'm interested in how we decide what we want to preserve and what we don't want to.
mopennock: Sure, will add suggestion below.
Bill Lefurgy: Interesting question. Would like to see some use cases for applying these tools. Often hear that de-duping & selective deletion isn't cost effective, but unsure if this has been measured.

Answer by jweise

One possibility is Archivematica. "Archivematica uses a micro-services design pattern to provide an integrated suite of software tools that allows users to process digital objects from ingest to access in compliance with the ISO-OAIS functional model."

A second type of approach is described in a paper called "Automating Digital Processing at the Bentley Historical Library" that was presented by Michael Shallcross and Nancy Deromedi at iPRES 2012. They assembled a Windows-based processing workflow called "AutoPro" comprised of numerous off-the-shelf tools and custom batch scripts to facilitate appraisal. Their second slide lists the tools they are using under "4. Digital Processing." I am not replicating the list here because the three brief documents they provide are very concise and it would be a shame to lose the context.

Comments

Ben Fino-Radin: As much as I am Archivematica's biggest cheerleader, it is not an appraisal tool. Archivematica is a processing pipeline, and in theory one one would not pass materials until they were ready to enter one's repository.
Courtney C. Mumma: Thanks for the props for Archivematica, but I'll correct Ben insofar as we aren't an appraisal tool, yet. We actually have plans for appraisal, arrangement and description functionality on our roadmap for 1.0 release in the fall. We are looking to mimic digital forensics tools and visualization tools (or use them when open source versions are available), and especially interested in the open source tools being developed within the context of the BitCurator project.
jweise: Taking the full description into account, isn't the original question about both appraisal and processing?
Nick Krabbenhoeft: @CourtneyC.Mumma If it isn't too early talk about your roadmap, can you describe some of those tools you're looking at?
Courtney C. Mumma: Hi Nick, We are still in the planning and testing phase, but you can find our overview here: https://www.archivematica.org/wiki/Transfer_and_SIP_creation and the BitCurator tools are described on their wiki: http://wiki.bitcurator.net/index.php?title=Main_Page
Ben Fino-Radin: @CourtneyC.Mumma more reasons to love you guys ;)

Answer by Greg Jansen

The Curator's Workbench is useful for appraisal in some scenarios. It doesn't report much technical metadata, but it does help you capture the file structure and make a new arrangement of items. It will stage files and calculate checksums while you work with the folder structure and names. For more information, see the link: http://www.lib.unc.edu/blogs/cdr/index.php/about-the-curators-workbench/

Comments

Answer by Andy Jackson

For the web archives I work with, we use Apache Tika to extract properties of interest along with text (for search indexing) along with a few extensions of our own. This works well from Java and on streamed data, which suites our HDFS-hosted WARC files very well.

Comments

Answer by mopennock

A few prototypes have been hacked together at SPRUCE mashups along these lines and, like Andy's work with the web archive, have used Apache Tika. This one seems more pertinent - Extracting and aggregating metadata with Apache Tika. The datasets in question were mainly text and PDF.

Comments

Nick Krabbenhoeft: Fantastic! I'd love to see a demo of the n-gram word cloud. sounds like a great potential tool.
Peter Cliff: The tool @mopennock mentions received some cash for further development. Not sure where that got to, but I also did a little more work on it and you can run this from a Java GUI - https://github.com/openplanets/SPRUCE/tree/master/ioe_hwj_text_colls/appraisomatic
don't try it on anything important though as it hasn't been well tested and will gladly overwrite content, etc. Get in touch if you've any questions!