Best practice for digitally redacting information from born-digital documents?

In the case of spreadsheets, databases, and other lengthy content printouts may not be practical - or the organization may be working with a remote patron online. Other born-digital formats just aren't conducive to printing no matter the size.

What's the recommended approach to digitally redacting info from digital documents in such a way that the removed information is not available at all in the bitstream? Most file formats do not exhibit a one to one correspondence between bytes and displayable characters, so my concern is with embedded metadata or data indicating previous states.

walker

digital-collections
digital-archiving
privacy

Comments

Answer by Adrian Brown

Try the UK National Archive Redaction toolkit.

A summary of their recommendations for digitally redacting:

Conversion: convert the document to a format that contains displayable information only, e.g. ASCII or Windows Bitmap.
Roundtrip redaction:

The redacted record may be required to be made available in its original format, for example, to preserve complex formatting. In such cases, an extension of the conversion approach may be applicable. Roundtripping entails the conversion of the record to another format, followed by conversion back to the original format, such that the conversion process removes all evidence of the redacted information. Information deletion may be carried out either prior to conversion, or in the intermediary format.

Examples given are PDF > .bmp > PDF, or in the case of spreadsheets, Excel > .csv > Excel.

Windows Bitmap and .csv are selected as the intermediary formats because they contain no provision for storing metadata - edits made before or in these formats should be opaque once the document hits the third or final format.

Comments

Answer by anarchivist

It depends on the motivation for redaction. Adrian's reference to the UK National Archives' redaction toolkit is great, but in our case we've been curious to say if there is a way we can only redact personally-identifiable information in things like databases or textual records.

I haven't tried it personally, but you may want to look at the MITRE Identification Scrubber Toolkit, which was originally written to redact electronic health records.

Comments

Answer by Euan

The commercial service available from Pingar seems quite effective. It can automatically redact digital documents. A demo is available here.

I am not a representative of theirs though and I recommend searching for alternatives also, there are a few out there.

Comments

Answer by Jenn Riley

The BitCurator toolkit includes a utility (currently bulk_extractor) for automatic identification of personally-identifiable information in documents.

Unfortunately I don't know if these tools would address your concern about previous states of information, but there's lots of information on the BitCurator wiki, so perhaps you'd find the answer there.