Hashing

This page makes heavy use of JavaScript to visualise the concepts discussed. Viewing it without JavaScript will be a strange experience, as the text talks about the visualisations. I strongly recommend either enabling JavaScript, or not wasting your time.

As a programmer, you use hash functions every day. They're used in databases to optimise queries, they're used in data structures to make things faster, they're used in security to keep data safe. Almost every interaction you have with technology will involve hash functions in one way or another.

Hash functions are foundational, and they are everywhere.

But what is a hash function, and how do they work?

In this post, we're going to demystify hash functions. We're going to start by looking at a simple hash function, then we're going to learn how to test if a hash function is good or not, and then we're going to look at a real-world use of hash functions: the hash map.

This article has visualisations that can be clicked.

# What is a hash function?

Hash functions are functions that take an input, usually a string, and produce a number. If you were to call a hash function multiple times with the same input, it will always return the same number, and that number returned will always be within a promised range. What that range is will depend on the hash function, some use 32-bit integers (so 0 to 4 billion), others go much larger.

If we were to write a dummy hash function in JavaScript, it might look like this:

function hash(input) {
  return 0;
}

Even without knowing how hash functions are used, it's probably no surprise that this hash function is useless. Let's see how we can measure how good a hash function is, and after that we'll do a deep dive on how they're used within hash maps.

# What makes a hash function good?

Because input can be any string, but the number returned is within some promised range, it's possible that two different inputs can return the same number. This is called a "collision," and good hash functions try to minimise how many collisions they produce.

It's not possible to completely eliminate collisions, though. If we wrote a hash function that returned a number in the range 0 to 7, and we gave it 9 unique inputs, we're guaranteed at least 1 collision.

hash("to") == 3 hash("the") == 2 hash("café") == 0 hash("de") == 6 hash("versailles") == 4 hash("for") == 5 hash("coffee") == 0 hash("we") == 7 hash("had") == 1

Output values from a well-known hash function, modulo 8. No matter what 9 values we pass, there are only 8 unique numbers and so collisions are inevitable. The goal is to have as few as possible.

To visualise collisions, I'm going to use a grid. Each square of the grid is going to represent a number output by a hash function. Here's an example 8x2 grid. Click on the grid to increment the example hash output value and see how we map it to a grid square. See what happens when you get a number larger than the number of grid squares.


    13 % 16 == 13

Every time we hash a value, we're going to make its corresponding square on the grid a bit darker. The idea is to create an easy way to see how well a hash function avoids collisions. What we're looking for is a nice, even distribution. We'll know that the hash function isn't good if we have clumps or patterns of dark squares.

You said that when a hash function outputs the same value for 2 different inputs, that's a collision. But if we have a hash function that outputs values in a big range, and we mapped those to a small grid, aren't we going to create lots of collisions on the grid that aren't actually collisions? On our 8x2 grid, 1 and 17 both map to the 2nd square.

This is a great observation. You're absolutely right, we're going to be creating "pseudo-collisions" on our grid. It's okay, though, because if the hash function is good we will still see an even distribution. Incrementing every square by 100 is just as good a distribution as incrementing every square by 1. If we have a bad hash function that collides a lot, that will still stand out. We'll see this shortly.

Let's take a larger grid and hash 1,000 randomly-generated strings. You can click on the grid to hash a new set of random inputs, and the grid will animate to show you each input being hashed and placed on the grid.

The values are nice and evenly distributed because we're using a good, well-known hash function called murmur3. This hash is widely used in the real-world because it has great distribution while also being really, really fast.

What would our grid look like if we used a bad hash function?

function hash(input) {
  let hash = 0;
  for (let c of input) {
    hash += c.charCodeAt(0);
  }
  return hash % 1000000;
}

This hash function loops through the string that we're given and sums the numeric values of each character. It then makes sure that the value is between 0 and 1000000 by using the modulus operator (%). Let's call this hash function stringSum.

Here it is on the grid. Reminder, this is 1,000 randomly generated strings that we're hashing.

This doesn't look all that different from murmur3. What gives?

The problem is that the strings we're giving to be hashed are random. Let's see how each function performs when given input that is not random: the numbers from 1 to 1000 converted to strings.

Now the problem is more clear. When the input isn't random, the output of stringSum forms a pattern. Our murmur3 grid, however, looks the same as how it looked with random values.

How about if we hash the top 1,000 most common English words:

It's more subtle, but we do see a pattern on the stringSum grid. As usual, murmur3 looks the same as it always does.

This is the power of a good hash function: no matter the input, the output is evenly distributed. Let's talk about one more way to visualise this and then talk about why it matters.

# The avalanche effect

Another way hash functions get evaluated is on something called the "avalanche effect." This refers to how many bits in the output value change when just a single bit of the input changes. To say that a hash function has a good avalanche effect, a single bit flip in the input should result in an average of 50% the output bits flipping.

It's this property that helps hash functions avoid forming patterns in the grid. If small changes in the input result in small changes in the output, you get patterns. Patterns indicate poor distribution, and a higher rate of collisions.

Below, we are visualising the avalanche effect by showing two 8-bit binary numbers. The top number is the input value, and the bottom number is the murmur3 output value. Click on it to flip a single bit in the input. Bits that change in the output will be green, bits that stay the same will be red.

murmur3 does well, though you will notice that sometimes fewer than 50% of the bits flip and sometimes more. This is okay, provided that it is 50% on average.

Let's see how stringSum performs.

Well this is embarassing. The output is equal to the input, and so only a single bit flips each time. This does make sense, because stringSum just sums the numeric value of each character in the string. This example only hashes the equivalent of a single character, which means the output will always be the same as the input.

# Why all of this matters

We've taken the time to understand some of the ways to determine if a hash function is good, but we've not spent any time talking about why it matters. Let's fix that by talking about hash maps.

To understand hash maps, we first must understand what a map is. A map is a data structure that allows you to store key-value pairs. Here's an example in JavaScript:

let map = new Map();
map.set("hello", "world");
console.log(map.get("hello"));

Here we take a key-value pair ("hello" → "world") and store it in the map. Then we print out the value associated with the key "hello", which will be "world".

A more fun real-world use-case would be to find anagrams. An anagram is when two different words contain the same letters, for example "antlers" and "rentals" or "article" and "recital." If you have a list of words and you want to find all of the anagrams, you can sort the letters in each word alphabetically and use that as a key in a map.

let words = [
  "antlers",
  "rentals",
  "sternal",
  "article",
  "recital",
  "flamboyant",
];

let map = new Map();

for (let word of words) {
  let key = word.split("").sort().join("");

  if (!map.has(key)) {
    map.set(key, []);
  }
  map.get(key).push(word);
}

This code results in a map with the following structure:

{
  "aelnrst": ["antlers", "rentals", "sternal"],
  "aceilrt": ["article", "recital"],
  "aabflmnoty": ["flamboyant"]
}

# Implementing our own simple hash map

Hash maps are one of many map implementations, and there are many ways to implement hash maps. The simplest way, and the way we're going to demonstrate, is to use a list of lists. The inner lists are often referred to as "buckets" in the real-world, so that's what we'll call them here. A hash function is used on the key to determine which bucket to store the key-value pair in, then the key-value pair is added to that bucket.

Let's walk through a simple hash map implementation in JavaScript. We're going to go through it bottom-up, so we'll see some utility methods before getting to the set and get implementations.

class HashMap {
  constructor() {
    this.bs = [[], [], []];
  }
}

We start off by creating a HashMap class with a constructor that sets up 3 buckets. We use 3 buckets and the short variable name bs so that this code displays nicely on devices with smaller screens. In reality, you could have however many buckets you want (and better variable names).

class HashMap {
  // ...
  bucket(key) {
    let h = murmur3(key);
    return this.bs[h % this.bs.length];
  }
}

The bucket method uses murmur3 on the key passed in to find a bucket to use. This is the only place in our hash map code that a hash function is used.

class HashMap {
  // ...
  entry(bucket, key) {
    for (let e of bucket) {
      if (e.key === key) {
        return e;
      }
    }
    return null;
  }
}

The entry method takes a bucket and a key and scans the bucket until it finds an entry with the given key. If no entry is found, null is returned.

class HashMap {
  // ...
  set(key, value) {
    let b = this.bucket(key);
    let e = this.entry(b, key);
    if (e) {
      e.value = value;
      return;
    }
    b.push({ key, value });
  }
}

The set method is the first one we should recognise from our earlier JavaScript Map examples. It takes a key-value pair and stores it in our hash map. It does this by using the bucket and entry methods we created earlier. If an entry is found, its value is overwritten. If no entry is found, the key-value pair is added to the map. In JavaScript, { key, value } is shorthand for { key: key, value: value }.

class HashMap {
  // ...
  get(key) {
    let b = this.bucket(key);
    let e = this.entry(b, key);
    if (e) {
      return e.value;
    }
    return null;
  }
}

The get method is very similar to set. It uses bucket and entry to find the entry related to the key passed in, just like set does. If an entry is found, its value is returned. If one isn't found, null is returned.

That was quite a lot of code. What you should take away from it is that our hash map is a list of lists, and a hash function is used to know which of the lists to store and retrieve a given key from.

Here's a visual representation of this hash map in action. Click anywhere on the buckets to add a new key-value pair using our set method. To keep the visualisation simple, if a bucket were to "overflow", the buckets are all reset.

Because we're using murmur3 as our hash function, you should see good distribution between the buckets. It's expected you'll see some imbalance, but it should generally be quite even.

To get a value out of the hash map, we first hash the key to figure out which bucket the value will be in. Then we have to compare the key we're searching for against all of the keys in the bucket.

It's this search step that we minimise through hashing, and why murmur3 is optimised for speed. The faster the hash function, the faster we find the right bucket to search, the faster our hash map is overall.

This is also why reducing collisions is so crucial. If we did decide to use that dummy hash function from all the way at the start of this article, the one that returns 0 all the time, we'll put all of our key-value pairs into the first bucket. Finding anything could mean we have to check all of the values in the hash map. With a good hash function, with good distribution, we reduce the amount of searching we have to do to 1/N, where N is the number of buckets.

Let's see how stringSum does.

Interestingly, stringSum seems to distribute values quite well. You notice a pattern, but the overall distribution looks good.

Finally! A win for stringSum. I knew it would be good for something.

Not so fast, Haskie. We need to talk about a serious problem. The distribution looks okay on these sequential numbers, but we've seen that stringSum doesn't have a good avalanche effect. This doesn't end well.

# Real-world collisions

Let's look at 2 real-world data sets: IP addresses and English words. What I'm going to do is take 100,000,000 random IP addresses and 466,550 English words, hash all of them with both murmur3 and stringSum, and see how many collisions we get.

IP Addresses

	`murmur3`	`stringSum`
Collisions	1,156,959	99,999,566
	1.157%	99.999%

English words

	`murmur3`	`stringSum`
Collisions	25	464,220
	0.005%	99.5%

When we use hash maps for real, we aren't usually storing random values in them. We can imagine counting the number of times we've seen an IP address in rate limiting code for a server. Or code that counts the occurrences of words in books throughout history to track their origin and popularity. stringSum sucks for these applications because of it's extremely high collision rate.

# Manufactured collisions

Now it's murmur3's turn for some bad news. It's not just collisions caused by similarity in the input we have to worry about. Check this out.

What's happening here? Why do all of these jibberish strings hash to the same number?

I hashed 141 trillion random strings to find values that hash to the number 1228476406 when using murmur3. Hash functions have to always return the same output for a specific input, so it's possible to find collisions by brute force.

I'm sorry, 141 trillion? Like... 141 and then 12 zeroes?

Yes, and it only took me 25 minutes. Computers are fast.

Bad actors having easy access to collisions can be devastating if your software builds hash maps out of user input. Take HTTP headers, for example. An HTTP request looks like this:

GET / HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: google.com

You don't have to understand all of the words, just that the first line is the path being requested and all of the other lines are headers. Headers are Key: Value pairs, so HTTP servers tend to use maps to store them. Nothing stops us from passing any headers we want, so we can be really mean and pass headers we know will cause collisions. This can significantly slow down the server.

This isn't theoretical, either. If you search "HashDoS" you'll find a lot more examples of this. It was a really big deal in the mid-2000s.

There are a few ways to mitigate this specific to HTTP servers: ignoring jibberish header keys and limiting the number of headers you store, for example. But modern hash functions like murmur3 offer a more generalised solution: randomisation.

Earlier in this post we showed some examples of hash function implementations. Those implementations took a single argument: input. Lots of modern hash functions take a 2nd parameter: seed (sometimes called salt). In the case of murmur3, this seed is a number.

So far, we've been using 0 as the seed. Let's see what happens with the collisions I've collected when we use a seed of 1.

Just like that, 0 to 1, the collisions are gone. This is the purpose of the seed: it randomises the output of the hash function in an unpredictable way. How it achieves this is beyond the scope of this article, all hash functions do this in their own way.

The hash function still returns the same output for the same input, it's just that the input is a combination of input and seed. Things that collide with one seed shouldn't collide when using another. Programming languages often generate a random number to use as the seed when the process starts, so that every time you run your program the seed is different. As a bad guy, not knowing the seed, it is now impossible for me to reliably cause harm.

If you look closely in the above visualisation and the one before it, they're the same values being hashed but they produce different hash values. The implication of this is that if you hash a value with one seed, and want to be able to compare against it in the future, you need to make sure you use the same seed.

Having different values for different seeds doesn't affect the hash map use-case, because hash maps only live for the duration the program is running. Provided you use the same seed for the lifetime of the program, your hash maps will continue to work just fine. If you ever store hash values outside of your program, in a file for example, you need to be careful you know what seed has been used.

# Playground

As is tradition, I've made a playground for you to write your own hash functions and see them visualised with the grids seen in this article. Click here to try it!

# Conclusion

We've covered what a hash function is, some ways to measure how good it is, what happens when it's not good, and some of the ways they can be broken by bad actors.

The universe of hash functions is a large one, and we've really only scratched the surface in this post. We haven't spoken about cryptographic vs non-cryptographic hashing, we've touched on only 1 of the thousands of use-cases for hash functions, and we haven't talked about how exactly modern hash functions actually work.

Some further reading I recommend if you're really enthusiastic about this topic and want to learn more:

https://github.com/rurban/smhasher this repository is the gold standard for testing how good hash functions are. They run a tonne of tests against a wide number of hash functions and present the results in a big table. It will be difficult to understand what all of the tests are for, but this is where the state of the art of hash testing lives.
https://djhworld.github.io/hyperloglog/ this is an interactive piece on a data structure called HyperLogLog. It's used to efficiently count the number of unique elements in very, very large sets. It uses hashing to do it in a really clever way.
https://www.gnu.org/software/gperf/ is a piece of software that, when given the expected set of things you want to hash, can generate a "perfect" hash function automatically.

Feel free to join the discussion on Hacker News!

# Acknowledgements

Thanks to everyone who read early drafts and provided invaluable feedback.

delroth, Manon, Aaron, Charlie

And everyone who helped me find murmur3 hash collisions:

Indy, Aaron, Max

# Patreon

After the success of Load Balancing and Memory Allocation, I have decided to set up a Patreon page: https://patreon.com/samwho. For all of these articles going forward, I am going to post a Patreon-exclusive behind-the-scenes post talking about decisions, difficulties, and lessons learned from each post. It will give you a deep look in to how these articles evolve, and I'm really stoked about the one I've written for this one.

If you enjoy my writing, and want to support it going forward, I'd really appreciate you becoming a Patreon. ❤️