What types of data compression tactics have a place in digital preservation practice?

Various tactics for data compression can decrease the cost of long-term preservation by reducing the amount of storage space required. At the same time, different compression approaches bring with them their own risks. How should one weigh the risks and rewards in the case of the following kinds of compression?

Three types of compression to consider (feel free to suggest others as a comment)

File compression: using a file compression algorithm suited to the file type. (Considering both lossy or lossless compression)
Hardware compression: which usually means compression done by a tape drive as the data is written to tape
Disk compression: which is performed by many new storage appliances and uses a combination of compression and de-duplication

Trevor Owens

digital-preservation
data-management

Comments

Andy Jackson: Might it also be worth mentioning that file-level compression might be lossy or lossless in the question?
Trevor Owens: Good point Andy. I will make that edit.
Bill Lefurgy: There is also signal compression that takes place through a camera or other sensor. Might be a version of 2.
Michael Kjörling: I think it's important to also consider the scope of the preservation effort. Basically, *how much fidelity do you require?* In some cases, perhaps JPEG or MP3 work because of their relative ubiquity and ability to preserve what one wants to preserve, but in other uses, they might be completely inadequate because of their lossy nature, patent encumberance or for some other reason. Such distinctions can often be drawn between lossy and lossless compression, but that difference does not have to be the defining line in every case.

Answer by Joe Atzberger

You will need further specifications, including:

use patterns (Do we need to search? What services do we expose based on this data?),
performance expectations (When we retrieve files from the system, how much time should it take? How small is small enough?),
profiles of data being stored (formats, sizes, volume)

It may actually be an incorrect approach to focus on disk and file-level representation as something you will personally improve by intervening with compression/decompression. Many of the best platforms attempt to encapsulate this level of detail from the user, allowing the platform to handle complications like replication, availability, indexing, data integrity, caching, etc.

All significant compression gains involve some trade-off between read and write costs. You may introduce performance-limiting processor costs merely by attempting to repeatedly retrieve and decompress a popular set of files (w/o caching) or by attempting to repeatedly write "just one more" file to an already compressed set. Or worse! For example, attempting to file-compress rich media already encoded in a compression-enabled codec frequently produces a larger file that now requires additional space and serial processing.

Comments

Joe: read/write costs are actually strange ... if you're not already CPU bound, you might actually speed up writes & reads because you've got less disk/tape IO. You really have to know where you're bottlenecked.