What tools can we use to appraise content before digital preservation?
Short-term storage seems cheap, but long-term storage for digital
preservation is
expensive.
Part of the solution to this problem is using archival appraisal to
identify what content should be preserved and what content can be
discarded, but how do we appraise gigabytes or terabytes of data?
Visualization tools like WinDirStat and
SpaceSniffer let
you scan a folder structure quickly to prioritize potentially redundant
data (e.g. system files) or core content (e.g. My Documents). Other
tools like C3PO let you survey
technical metadata, allowing you to look at rough estimates of format
types to see what formats a creator used the most. Are there other tools
used to quickly appraise data?
Nick Krabbenhoeft
Comments
- mopennock: Could you clarify the purpose of the appraisal, other than reducing
storage space? The purpose will make it easier for people to identify
appropriate tools.
- Nick Krabbenhoeft: I changed the wording to "identify what content should be preserved and
what content can be discarded." Does that clarify it enough? I'm
interested in how we decide what we want to preserve and what we don't
want to.
- mopennock: Sure, will add suggestion below.
- Bill Lefurgy: Interesting question. Would like to see some use cases for applying
these tools. Often hear that de-duping & selective deletion isn't cost
effective, but unsure if this has been measured.
Answer by jweise
One possibility is
Archivematica.
"Archivematica uses a micro-services design pattern to provide an
integrated suite of software tools that allows users to process digital
objects from ingest to access in compliance with the ISO-OAIS functional
model."
A second type of approach is described in a paper called "Automating
Digital Processing at the Bentley Historical
Library" that was presented by
Michael Shallcross and Nancy Deromedi at iPRES 2012. They assembled a
Windows-based processing workflow called "AutoPro" comprised of numerous
off-the-shelf tools and custom batch scripts to facilitate appraisal.
Their second slide lists the tools they are using under "4. Digital
Processing." I am not replicating the list here because the three brief
documents they provide are very concise and it would be a shame to lose
the context.
Comments
- Ben Fino-Radin: As much as I am Archivematica's biggest cheerleader, it is not an
appraisal tool. Archivematica is a processing pipeline, and in theory
one one would not pass materials until they were ready to enter one's
repository.
- Courtney C. Mumma: Thanks for the props for Archivematica, but I'll correct Ben insofar as
we aren't an appraisal tool, yet. We actually have plans for appraisal,
arrangement and description functionality on our roadmap for 1.0 release
in the fall. We are looking to mimic digital forensics tools and
visualization tools (or use them when open source versions are
available), and especially interested in the open source tools being
developed within the context of the BitCurator project.
- jweise: Taking the full description into account, isn't the original question
about both appraisal and processing?
- Nick Krabbenhoeft: @CourtneyC.Mumma If it isn't too early talk about your roadmap, can you
describe some of those tools you're looking at?
- Courtney C. Mumma: Hi Nick, We are still in the planning and testing phase, but you can
find our overview here:
https://www.archivematica.org/wiki/Transfer_and_SIP_creation and the
BitCurator tools are described on their wiki:
http://wiki.bitcurator.net/index.php?title=Main_Page
- Ben Fino-Radin: @CourtneyC.Mumma more reasons to love you guys ;)
Answer by Greg Jansen
The Curator's Workbench is useful for appraisal in some scenarios. It
doesn't report much technical metadata, but it does help you capture the
file structure and make a new arrangement of items. It will stage files
and calculate checksums while you work with the folder structure and
names. For more information, see the link:
http://www.lib.unc.edu/blogs/cdr/index.php/about-the-curators-workbench/
Comments
Answer by Andy Jackson
For the web archives I work with, we use Apache
Tika to extract properties of interest along
with text (for search indexing) along with a few extensions of our
own. This works well from Java
and on streamed data, which suites our HDFS-hosted WARC files very well.
Comments
Answer by mopennock
A few prototypes have been hacked together at SPRUCE
mashups along these
lines and, like Andy's work with the web archive, have used Apache Tika.
This one seems more pertinent - Extracting and aggregating metadata
with Apache
Tika.
The datasets in question were mainly text and PDF.
Comments
- Nick Krabbenhoeft: Fantastic! I'd love to see a demo of the n-gram word cloud. sounds like
a great potential tool.
- Peter Cliff: The tool @mopennock mentions received some cash for further development.
Not sure where that got to, but I also did a little more work on it and
you can run this from a Java GUI -
https://github.com/openplanets/SPRUCE/tree/master/ioe_hwj_text_colls/appraisomatic
- don't try it on anything important though as it hasn't been well
tested and will gladly overwrite content, etc. Get in touch if you've
any questions!