MD5? SHA1? – Some facts and mis-conceptions about the checksum value

A checksum is mathematically calculated value that is used to detect data integrity. There are a few well known checksum algorithms in common use, Cyclic Redundancy Check (CRC), Message Digest 5 (MD5), and Secure Hash Algorithm 1 (SHA-1). While there are more than these three checksum algorithms, let’s just focus on these three for the moment.

Checksum algorithms take digital data and spit out a number. For example, let’s calculate the checksum value for the work “Hello” using the CRC algorithm. Using a simple Linux system, we can generate a checksum of the word “Hello” using the following command.

$ echo “Hello” | sum
36978 1

(In the above, the 36978 is the checksum value, and the “1” is the size of the input in blocks. We can ignore the trailing one.) If we change the capital H to a lowercase h and recalculate the checksum value, we will get a different result.

$ echo “hello” | sum
36979 1

Let’s add a space to the end of the input.

$ echo “Hello ” | sum
18510 1

This is what makes the checksum valuable. The output value is different when the input is different. A good checksum algorithm will produce the same value on the same input, and different values on different input. And, it will produce the same value on the same input on any computer.

The table below shows the checksum values for the three different variations of the word hello calculated with the three different algorithms.

Sample checksum values for CRC, MD5 and SHA1

“Hello” “hello” “Hello “ (with space)
CRC 36978 36979 18510
MD5 09f7e02f1290be211da707a266f153b3 b1946ac92492d2347c6235b4d2611184 adb3f07f896745a101145fc3c1c7b2ea
SHA1 1d229271928d3f9e2bb0375bd6ce5db6c6d348d9 f572d396fae9206628714fb2ce00f72e94f2258f a83f9352aa642ceec0a03b126e453a5984cf68ab

Notice that the checksum values for the same word are different when using a different algorithm. CRC does not produce the same value on the same input as MD5. And, MD5 produces a different value than SHA1. This means that you cannot use MD5 to verify a checksum value calculated with SHA1.

Myth – Knowing the checksum value, I can regenerate the input.

Checksum values are not easily reversible because the checksum algorithm throws away information during the calculation. Because of this, the checksum value of “36978” can’t be converted back into “Hello”, because Hello is one of many different possible inputs that could create that value. This leads to another myth…

Myth- A good checksum algorithm prevents collision.

A checksum collision happens is when two different values return the same checksum value. For example, the CRC checksum value for “Hello” is 36978, as is the CRC checksum value for “Jdll0”. (With the CRC algorithm, a collision can be easily generated by lowering the first letter of the word by one character while raising the second letter by one character.) A checksum collision is always possible no matter how good the checksum algorithm is. This is because a checksum has to take a file of some arbitrary size and reduce it to a number. A good checksum algorithm will just make it difficult to predictable manipulate the input to create a known hash value. MD5 and SHA1, since they are cryptographic hash functions, make it more difficult to manipulate input to produce a predictable checksum value.

Myth – A checksum value can be used to prove that data has been read correctly.

Since checksums can be used to detect alterations in digital input, they can be very useful in computer forensics. Checksum values can help to establish a very low probability of alteration of digital evidence once it has been captured. A checksum is extremely effective when it is declared after the acquisition of electronic evidence. The declaration of the checksum should be printed or otherwise stored to prevent potential alteration or tampering. Should the checksum of the evidence later be found to not match the declared checksum, there is a possibility that the evidence or evidence container has been altered. (Note carefully that it is possible, not definite.) Factors such as disk errors and errors in the checksum implementation can also result in a checksum mismatch.
What a checksum cannot do is prove that the correct digital evidence was acquired. Here is an example to consider. My company makes forensic imagers, and forensic imagers undergo validation testing by neutral third parties. Basically, these third parties are checking that the product does not alter the data that it was copying and that it copies all of the data.

During a couple of the different validations by different groups, we were contacted by the testers. The testers had told us that they had noticed that the checksum values that we produced were occasionally different than the ones that they produced using their equipment. Follow-up up investigation revealed that the checksums were indeed different, and in all cases it was because our system was capturing more disk data than their test system was. Well, that is good news for us, but why were they different? It turns out that when you capture more data, you need to run the additional data into the checksum algorithm. That, in turn, changed the checksum value and led to the difference.

This highlights that the checksum algorithm cannot be used to determine if the original disk drive was read correctly. As happened here, the validation team’s checksum values had matched when they did not read all of the data. The checksum value was very useful to determine that the source disk was not modified, but was not useful in determining that the source disk was read completely.