What types of data compression tactics have a place in digital preservation practice?
Various tactics for data compression can decrease the cost of long-term
preservation by reducing the amount of storage space required. At the
same time, different compression approaches bring with them their own
risks. How should one weigh the risks and rewards in the case of the
following kinds of compression?
Three types of compression to consider (feel free to suggest others as a
comment)
- File compression: using a file compression algorithm suited to the
file type. (Considering both lossy or lossless compression)
- Hardware compression: which usually means compression done by a tape
drive as the data is written to tape
- Disk compression: which is performed by many new storage appliances
and uses a combination of compression and de-duplication
Trevor Owens
Comments
- Andy Jackson: Might it also be worth mentioning that file-level compression might be
lossy or lossless in the question?
- Trevor Owens: Good point Andy. I will make that edit.
- Bill Lefurgy: There is also signal compression that takes place through a camera or
other sensor. Might be a version of 2.
- Michael Kjörling: I think it's important to also consider the scope of the preservation
effort. Basically, *how much fidelity do you require?* In some cases,
perhaps JPEG or MP3 work because of their relative ubiquity and ability
to preserve what one wants to preserve, but in other uses, they might be
completely inadequate because of their lossy nature, patent encumberance
or for some other reason. Such distinctions can often be drawn between
lossy and lossless compression, but that difference does not have to be
the defining line in every case.
Answer by Joe Atzberger
You will need further specifications, including:
- use patterns (Do we need to search? What services do we expose based
on this data?),
- performance expectations (When we retrieve files from the system,
how much time should it take? How small is small enough?),
- profiles of data being stored (formats, sizes, volume)
It may actually be an incorrect approach to focus on disk and file-level
representation as something you will personally improve by intervening
with compression/decompression. Many of the best platforms attempt to
encapsulate this level of detail from the user, allowing the platform to
handle complications like replication, availability, indexing, data
integrity, caching, etc.
All significant compression gains involve some trade-off between read
and write costs. You may introduce performance-limiting processor costs
merely by attempting to repeatedly retrieve and decompress a popular set
of files (w/o caching) or by attempting to repeatedly write "just one
more" file to an already compressed set. Or worse! For example,
attempting to file-compress rich media already encoded in a
compression-enabled codec frequently produces a larger file that now
requires additional space and serial processing.
Comments
- Joe: read/write costs are actually strange ... if you're not already CPU
bound, you might actually speed up writes & reads because you've got
less disk/tape IO. You really have to know where you're bottlenecked.