A hash is cryptographic algorithm that attempts to uniquely describe inputted content by outputting a value that is unique for a given piece of inputted content. A good hash algorithm has several characteristics, including:
* It will output the same hash result for the same exact input, every time.
* It will output a different hash result for a different input.
* Given only the hash output result, it should be non-trivial to impossible to recreate the input that generated the output.
If two different inputs create the same output, it's called a hash collision. In 2004 and 2005, Chinese Professor Xiaoyun Wang and other colleagues demonstrated that the hash algorithms MD5 and SHA-1 had weaknesses that allowed previously unexpected collisions at iteration rounds deemed cryptographically weak. It sent a minor shockwave through the cryptographic community and nearly everyone else paying attention.
[ RogerGrimes's column is now a blog! Get the latest IT security news from the Security Adviser blog. ]
MD5 and SHA-1, although not alone in a crowded hash field, are easily the most used cryptographic algorithms used to compare two pieces of stand-alone content in the PC world. Name a popular operating system (e.g. Windows, Macintosh, Linux, Unix, Solaris, BSD, etc.) and you'll find it full of MD5 and SHA-1 cryptographic comparisons. For example SHA-1 is used in Apple's OS for password hashing.
It sent a shockwave, but it didn't freak out the cryptographic community. First, a community full of double mathematical doctorates is a staid crowd to begin with. Second, being cryptographers, they're predicting this sort of stuff all the time. The discovery of the MD5 and SHA1 collisions was just a confirmation of something cryptographers repeat in every statement, about "...no algorithm can be proved to be secure," and other pronouncements like that. Plus, there had been other previous MD5 and SHA-1 findings that pointed to potential collision weaknesses.
The cryptographic community and NIST responded by telling vendors and users to use better and stronger hash algorithms, including bigger variants of SHA-1 called SHA-256 and SHA-512. The increase in bit size theoretically makes the collisions more difficult to generate (although, interestingly, the July 2007 draft recommendation still includes the smaller 160-bit SHA-1 algorithm).
Many vendors (i.e. any with common sense) have started to abandon MD-5 and SHA-1. For example, Microsoft made a formal announcement during the Windows Vista beta program to begin removing all instances of MD-5 and SHA-1 from Windows. Although MD-5 and SHA-1 remnants remain in use, many cryptographic functions have been upgraded, and what remains is on the chopping block for future Windows versions. Most other popular vendors are doing, or have plans to do, the same.
One of the biggest fear scenarios from a hash collision is that two programs, one legitimate and one malicious, could end up with the same hash output value (i.e. the bogus program could be substituted for the legitimate one with no one the wiser). But in the real world, it would be non-trivial for an attacker to find a legitimate target program and then set about making a malicious clone that would produce precisely the same hash value. Making a second program with the same hash is hard enough, but making a second malicious program that mimics the legitimate program well enough that people run it while it performs malicious instructions the attacker wants instead is orders of magnitude harder to pull off. Even I, a crypto-hobbyist, not a crypto expert, felt like the theoretical collisions of MD5 and SHA-1 were more of a mathematical issue and not a real problem.
Maybe there's more to this...
Then come along a few demonstrations that make me realize that I'm just a crypto hobbyist and I should really just stay out of making cryptographic pronouncements. One site that reminded of this is http://www.win.tue.nl/hashclash/Nostradamus. I've seen the demonstration before, but this site is the most recent one I've come across and it explains the involved issues pretty well, and does so with humor.
Essentially, the site contains 12 different PDF documents, each with different information, but with each having the same MD5 hash. This demo makes the topic of hash collisions more relevant by "predicting" which Democratic or Republican candidate will win the Presidency of the United States in 2008. To do so, they created 12 different documents, each one with a different candidate's name. Each of the 12 documents has the same hash result, along with the same file size and checksum, of course. You can check their results with any MD5 hash checker, although one of my favorite Windows programs is DigestIT 2004 because it makes generating hashes a right-click action on any file.
The hash collisions can be created because of a combination of MD5 hash weakness and the way certain document types construct and hide content. Part of the weakness is in the hash algorithm, which doesn't consider all content bits when constructing a particular output.
Of course, in this demonstration the content is just different names. No harm, no foul. But it is a ready example of how hash collisions can be performed with forethought and end up creating content that is different than the original. There have been many other demonstrations, including ones done with asymmetric keys and executables.
The next time you hear someone say that theoretical hash collisions aren't a big deal, send them to a web site that shows the practical realities -- and then ask them who's going to win the next election.