The Zombie Stack Exchanges That Just Won't Die
What measures, technologies, or techniques are applicable to guarding information against damage/loss during organisation?
Often when I'm archiving data, before the archiving process I also need to de-duplicate and organise the data. I'm aware that this is a risky time for the data and that human, software or mechanical error might lead to some data being corrupt or lost, without me even knowing about it.
So there are two questions:
How can I keep an eye on the data so that I can audit where it was versus where it is now, and see if any data has been deleted or changed during organisation?
How can I reduce human error while organising the data?
occulus
As you mention in your post, taking regular checksums should be the first step for ensuring continued integrity.
There are a number of tools that can help you manage files, check for duplicates and calculate checksums, including CINCH, the Duke Data Accessioner and Karen's Directory Printer. If you're concerned about getting rid of duplicate files without accidentally deleting non-duplicates, one strategy might be to output a file list with checksums, use it to detect and delete duplicates, and compare the resulting list with the original to confirm that no unique checksums were lost. You could save both outputs as evidence of the process. I don't know if any tools do this automatically but it shouldn't be difficult to script.
Overall, I had a number of devastating data losses during organization, so here are empirically chosen rules:
There are some emerging practices in archives that might be helpful:
The idea would be to capture a copy of what you want to preserve in it's most pristine condition, document that, and then make decisions on what you actually want to keep.
I think the most concise info I've seen on this is the technical section of the free OCLC report You've Got to Walk Before You Can Run: First Steps for Managing Born-Digital Content Received on Physical Media (PDF)
More broadly, the Open Archival Information System Reference model suggests thinking about submission information packages (what you start with or are given), archival information packages (what you keep for the long haul) and dissemination information packages (what you are going to make available in a given context). What's useful about this framework in my mind is thinking of that organization process as getting the submission together and creating the archival copy. So you want to do the things mentioned at the top to make sure you get the submission right before you start thinking about turning it into the archival package you plan on keeping. So, documenting what you got and what you kept is an important part of that.
As you suggest, checksums are definitely going to play a useful role in meeting this challenge. This Stack Q&A has some initial suggestions on approach and tools, although it may need a slightly different approach for your specific needs here.
If the data changes as part of the organisation process, checksums suddenly become a lot less useful. This is a very simple but effective tool for matching source and destination filenames, and may be of some help here.
There are many possible approaches for dealing with many different flavours of de-duplication. These are some descriptions of duplication challenges and some experiences of solving them, which might be of use.
As things stand I think there is still a need for a more comprehensive organisation or curatorial tool to tackle this challenge more effectively. It needs to enable all the potentially catastrophic changes (such as delete, and rename) while allowing undo of these changes, and capturing a change log or detailed provenance metadata. From my experiences of working with digital preservation practitioners from Libraries and Archives, this is a pretty common use case.
I would emphasize the importance of workflow planning and documentation.
A Submission Information Package (SIP), as described by the OAIS reference model, can develop out of multiple steps: initial transfer of data, possibly moving that transfer from a dropbox to a workspace, format conversion, metadata creation/extraction. Understanding each step of the process will help clarify at what point data is being manipulated and potentially damaged and thus at what point it makes sense to check for damage before the process has moved too far ahead.
This not the formal answer but as [primarily] a software developer, I felt the need to note that I commonly use Git to do this because it's a version control system which can be used completely locally and uses strong checksums to track file contents.
In practice, this means that I can work with incoming files like this (e.g. I've had to make technical corrections to partner-provided metadata, much of it in languages I cannot read):
git init
the directorygit add .
& git commit
to track the initial received versionThis is a very lightweight process which provides full history & integrity checks and has the nice aspect that you can easily synchronize copies with full history to other locations. It will work on Linux, Mac, Windows, etc. (GUI tools are available for all of these) and has very little friction once you've installed the software and learned a few basic operations.
The downside is that the approach is slow for very large binary files or operations which alter many tens of thousands of files. There are tools like bup which use the same format but are optimized for binary content and I would recommend investigating that if you are faced with that challenge.
Addressing this problem was a core design driver for the Curator's Workbench software at UNC. We needed to allow re-arrangement while leaving all original sources intact, i.e. read only. It's done by capturing the structure in a METS manifest; all subsequent manipulation is only editing the manifest and not the data.
This also has the benefit of allowing staging and checksums to proceed in the background, while you perform appraisal and arrangement.
For more information: http://www.lib.unc.edu/blogs/cdr/index.php/about-the-curators-workbench/