The Zombie Stack Exchanges That Just Won't Die
Given the probabilities of hash value collisions are MD5 hashes sufficient for ensuring file fixity? Or should SHA1 or SHA2 be used? Or, should folks be catching all three. I would be interested in both issues to consider for tamper resistance and for simply knowing if what I have is what I think I have.
Trevor Owens
If you only care about accidental data corruption: Any of those will do.
If there is a possibility of targeted attack: MD5 is broken, SHA1 weakened, SHA2 is doing very well. Note that SHA1 and SHA2 are completely different algorithms.
Also, if you're going to store LOTS of data (e.g. 2^70 files) then you need a hash algorithm which has at least 2X+20 bits due to birthday paradox.\ The X here is log2(number_of_files), so, for example, if the repo has 1 000 000 000 files (about 2^30), 2*30+20=80 bits of (good) hash is necessary to provide sufficient collision protection
Overall, I would use 256-bit SHA2 for next 100+ years.
For tamper resistance, using any two hash algorithms (even if one is broken) increases hack difficulty immensely. So calculating MD5 in addition to sha256 (because it is already widely used) should be more than enough.
MD5 is broken and SHA-1 will be crackable within five years (I especially like the speculation of using criminal organizations harnessing a botnet to crack SHA-1 cheaply.) Because archives help establish the historical records and have often be attacked/manipulated to alter that record, I don't doubt that digital archives will be the target of attacks to remove or alter that record as well.
Checksums will continue to be cracked over time, so digital preservation systems will need a graceful method to calculate new checksums for stored objects and store them in their database alongside past checksums. Retaining hashes from a variety of algorithms will also complicate the task of crafting a collision file.