October 8, 2018
Code & Design
A generic hash function is a special type of programming function that is used to map data of arbitrary size to data of a fixed size. Hash functions originated from a need to compress data in order to reduce the amount of memory required to store large files. The most popular use case for a hash function is for another specific data structure called a hash table, which is widely used for rapid data lookup. Hash functions help speed up a table or database lookup by detecting any two exact same hashes; they also help minify tags for enormous files such as mp3s, PDFs or images in order to make working with these rather large file types manageable. For rapid identification, a key requirement of hash functions is that they output a fixed-length string of alphanumeric characters.
While the core reason for the inception of a hash function came from a need compress content, a secondary benefit soon became a staple of hashing: singularly-unique identifiers. Ideally, when hashing multiple messages, no two different messages should ever return the same hash. Two different hashed messages resulting in the same output hash is called a collision. From a database management perspective, this would mean that two different objects end up being stored in the same cell — no good if one is looking to define singularly-unique identifiers. If we consider a hash function with infinite inputs (meaning we can hash any string), we can derive just exactly why collisions are in fact unavoidable.
Within cryptography math there exists a concept called the pigeonhole principle which states that if we fit (n) elements into (m) spaces where n > m, then, by principle, there exists at least one space (m) occupied by more than two elements (n).
For example, five individuals check their coats in one of three available coat cubbies. By the pigeonhole principle, since the number of coats stored (n) is greater than the available cubbies (m), it’s guaranteed that at least one cubby holds more than a single coat.
Typically software engineers are interested in hash functions with an infinite domain (that’s to say they take as input strings of all possible lengths) and a finite range. Again following the pigeonhole principle, since our range (n) is smaller than our domain (m), it follows that at least one collision must exist. An effective hash function therefore only looks to minimize the number of collisions — why this matters will become more clear in a moment, but for now, let’s go return to the history of hashes.
While hash functions originated strictly from database upkeep & management needs, which favored speed above all, their utility quickly evolved. A special branch of hash functions which favored privacy, security, & transparency soon entered the field; a branch of has functions that will remain the focus of this article: cryptographic hashing functions.
Cryptographic hashing functions, as the name aptly implies, favor the preservation of absolutely undisrupted messages. While minimizing collisions is good practice for other hash functions, for specifically cryptographic functions, minimizing collisions is a requirement. Instead of maximizing the utility for a rapid database or table lookup scenario, constructing cryptographic hashing functions are built with an adversarial scenario in mind: one in which a code-breaker (cryptoanalyst) is actively trying to cause a collision. We’ll now go define standard hash function notations & establish hash function principles within the cryptography perspective.
A generic cryptographic hash function has two inputs: the message it’s going to compress or hash ( x) & a public key (s) that represents the fixed-length output of our hash in alphanumeric characters. Our hashed result is termed the message digest or simply digest (x*). This looks like the following:
H(s,x) = x*
Let’s gloss over this notation by walking through a real-life example hashing a string using a previously-standard hashing function named MD5. Say we want to use MD5 to hash a “Hello World!” string. We also know that by default MD5 always outputs a string of 128 bits (0’s & 1's). This notation would look as follows:
H(128, x) = ed076287532e86365e841e92bfc50d8c
In fact, if you go ahead & try providing the MD5 hash function “Hello World!” yourself you should see the exact same resulting hash. Awesome. Now let’s move forward to setting the notation for a collision; in addition to previous variables H, s, x, & x* we now introduce a second message (x’). Numerically, a collision occurs when hashing two distinct messages (x & x’) results in the exact same message digest (x*):
If H(128,x) = H(128,x’), our hash function (H) is said to have a collision at x & x’.
We’ve now set the notation for the current hallmark standard of hash function cryptography; if an adversary can feasibly (computationally-speaking) cause a collision, a hash function is no longer considered practically secure.
The last mathematical definition is where the fascinating catch-22 for hash function practicality lives. Hash functions originated from a need to compress & output standardized uniform data for storage convenience, which means they spit out pseudorandom strings of a fixed length. Yet in yet in order to create a completely collision-resistant hash function, every single message (x) would have to have a hashed output of the same length as the input. Without hashes of a fixed length, we lose our ability to use as a convenient data structure, yet by assigning a fixed length, we render our hash function not completely collision-free.
PS — I’m sure some of you smart cookies noticed that in our MD5 example we notated a hash function that returns a string of length 128, yet our “Hello World!” hash returned a 32 alphanumeric character string. Come back next time & we’ll go deeper hash functions to explain where this difference stemmed.
References