🚀 Phases 1–4 are live — Days 1–13 + Day 17 are fully written. See the roadmap →

Collisions & Load Factor

Two keys hash to the same bucket. Now what?

There are two main strategies — and most stdlib hash maps use one of them, occasionally switching between them based on the situation.

Strategy 1: Separate Chaining

Each bucket holds a linked list (or a small dynamic array) of all the entries that hash to it. On insert, append to the list. On lookup, walk the list and compare keys.

Chaining at high load factor — multiple entries per bucket
[0]
→
apple:1
fig:5
[1]
→
kiwi:7
[2]
→
grape:6
[3]
→
cherry:3
[4]
→
banana:2
date:4
7 entries across 5 buckets — load factor 1.40

Code: a tiny hash map with chaining

Pros & cons of chaining

ProsCons
Simple to implement and reason aboutExtra memory per entry (the list node overhead)
Degrades gracefully under high loadCache-unfriendly — buckets are scattered
Load factor can safely exceed 1.0Worst case is O(n) if everything chains

Java’s HashMap uses chaining (and famously upgrades long chains to red-black trees once they exceed 8 entries).

Strategy 2: Open Addressing

Instead of a list per bucket, everything lives in the main array. On collision, probe for the next available slot.

The simplest variant — linear probing — works like this:

put(k, v):
    i = hash(k) % M
    while buckets[i] is occupied with a different key:
        i = (i + 1) % M
    buckets[i] = (k, v)

To look up, hash to the starting slot, then walk forward until you either find the key (return its value) or hit an empty slot (key not present).

                     [   ][ A:1 ][ B:2 ][ C:3 ][   ][   ][   ]
hash("D") → 2 (taken)
hash("D") → 3 (taken)
hash("D") → 4 (empty!)
                     [   ][ A:1 ][ B:2 ][ C:3 ][ D:4 ][   ][   ]

Pros & cons of open addressing

ProsCons
Excellent cache locality — everything in one arrayLoad factor must stay well below 1.0 (~0.5–0.75)
Lower memory overhead — no list-node pointersDeletion is tricky (need tombstones)
Faster in practice when sized correctlyClustering can hurt performance

Variants try to spread probes better:

  • Quadratic probing — try i + 1, i + 4, i + 9, …. Better distribution than linear.
  • Double hashing — use a second hash function to determine the probe step. Best distribution; used by Python’s dict.

Python’s dict and C++ unordered_map use open addressing variants. Most modern hash table implementations have moved this way.

The load factor

The load factor α = n / M (entries / buckets) is the single number that controls how fast a hash table is.

  • α small (< 0.5): lots of empty buckets, hash collisions rare, operations close to O(1) — wastes memory.
  • α around 0.5–0.75: sweet spot for most implementations.
  • α near 1.0 with chaining: still works but slows down — buckets get long.
  • α near 1.0 with open addressing: catastrophic — every probe might need to scan most of the table.

When α exceeds a threshold (typically 0.75), the hash table resizes: allocate a new array of double the size, then rehash every existing entry into the new array using the new modulus.

Old: 8 buckets, 6 entries — α = 0.75 → resize!
New: 16 buckets, 6 entries — α = 0.375 → roomy again

Resizing is O(n) work. But it happens roughly every time the table doubles in size, so it’s amortized to O(1) per insert:

  • After resizing to size 2M, the next resize needs n = 2M × 0.75 = 1.5M new entries.
  • We do M work for the resize, but we add 1.5M − 0.75M = 0.75M new entries between resizes.
  • Cost per insert: O(M) ÷ O(M new inserts) = O(1) amortized.

This is the same amortized argument as the dynamic array (vector, list, ArrayList) doubling strategy from Day 2. A rare expensive operation, spread across many cheap ones, averages out to O(1) per operation.

What if you implement a HashMap from scratch?

A common interview question. The skeleton:

class MyHashMap:
    def __init__(self):
        self.cap = 8
        self.n = 0
        self.buckets = [[] for _ in range(self.cap)]
 
    def _idx(self, key):
        return hash(key) % self.cap
 
    def _resize(self):
        old = self.buckets
        self.cap *= 2
        self.buckets = [[] for _ in range(self.cap)]
        self.n = 0
        for bucket in old:
            for k, v in bucket:
                self.put(k, v)        # re-insert under new capacity
 
    def put(self, key, value):
        bucket = self.buckets[self._idx(key)]
        for i, (k, _) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value); return
        bucket.append((key, value))
        self.n += 1
        if self.n / self.cap > 0.75:
            self._resize()
 
    def get(self, key, default=None):
        for k, v in self.buckets[self._idx(key)]:
            if k == key: return v
        return default
 
    def remove(self, key):
        bucket = self.buckets[self._idx(key)]
        for i, (k, _) in enumerate(bucket):
            if k == key:
                bucket.pop(i); self.n -= 1; return

Six core methods. Implement them once, understand them forever, and use the stdlib version everywhere else.

Common pitfalls

The five things people get wrong:

  1. Forgetting that keys can be None / null. Most stdlib hash maps allow null keys (one slot reserved). Hand-rolled implementations often crash.
  2. Mutating a key after insertion. If a key’s hash changes between put and get, you’ll never find it again. Use immutable keys — Python tuples but not lists, strings but not bytearrays.
  3. Comparing hashes instead of keys. Two different keys can have the same hash. Always do hash(k1) == hash(k2) AND k1 == k2 (and the stdlib already does this for you).
  4. Ignoring the load factor. If you build a hash map by repeated insertion without resizing, the table degrades to O(n) per operation.
  5. Using a slow equals / __eq__ on custom keys. Hash maps call equality every time you probe. A multi-millisecond equality is a problem.

Now that you understand the machinery, let’s see how you actually use a hash map in real code — and the standard library patterns that solve 90% of problems.