The Zombie Stack Exchanges That Just Won't Die
In the case of spreadsheets, databases, and other lengthy content printouts may not be practical - or the organization may be working with a remote patron online. Other born-digital formats just aren't conducive to printing no matter the size.
What's the recommended approach to digitally redacting info from digital documents in such a way that the removed information is not available at all in the bitstream? Most file formats do not exhibit a one to one correspondence between bytes and displayable characters, so my concern is with embedded metadata or data indicating previous states.
walker
Try the UK National Archive Redaction toolkit.
A summary of their recommendations for digitally redacting:
Conversion: convert the document to a format that contains displayable information only, e.g. ASCII or Windows Bitmap.
Roundtrip redaction:
The redacted record may be required to be made available in its original format, for example, to preserve complex formatting. In such cases, an extension of the conversion approach may be applicable. Roundtripping entails the conversion of the record to another format, followed by conversion back to the original format, such that the conversion process removes all evidence of the redacted information. Information deletion may be carried out either prior to conversion, or in the intermediary format.
Examples given are PDF > .bmp
> PDF, or in the case of
spreadsheets, Excel > .csv
> Excel.
Windows Bitmap and .csv
are selected as the intermediary formats
because they contain no provision for storing metadata - edits made
before or in these formats should be opaque once the document hits the
third or final format.
It depends on the motivation for redaction. Adrian's reference to the UK National Archives' redaction toolkit is great, but in our case we've been curious to say if there is a way we can only redact personally-identifiable information in things like databases or textual records.
I haven't tried it personally, but you may want to look at the MITRE Identification Scrubber Toolkit, which was originally written to redact electronic health records.
The commercial service available from Pingar seems quite effective. It can automatically redact digital documents. A demo is available here.
I am not a representative of theirs though and I recommend searching for alternatives also, there are a few out there.
The BitCurator toolkit includes a utility (currently bulk_extractor) for automatic identification of personally-identifiable information in documents.
Unfortunately I don't know if these tools would address your concern about previous states of information, but there's lots of information on the BitCurator wiki, so perhaps you'd find the answer there.