Collisions & Load Factor

Two keys want the same bucket — and how you resolve that one conflict decides everything about your hash table’s speed, memory, and cache behavior. There are two answers (a list per bucket, or one shared array you probe through), each with a subtle trap: one wastes cache lines, the other makes deletion shockingly easy to get wrong. And a single number — the load factor — quietly governs when the whole thing must stop and rebuild itself.

Two keys hash to the same bucket. Now what?

There are two main strategies — and most stdlib hash maps use one of them, occasionally switching between them based on the situation.

Strategy 1: Separate Chaining

Each bucket holds a linked list (or a small dynamic array) of all the entries that hash to it. On insert, append to the list. On lookup, walk the list and compare keys.

Chaining at high load factor — multiple entries per bucket

[0]

→

apple:1

fig:5

[1]

→

kiwi:7

[2]

→

grape:6

[3]

→

cherry:3

[4]

→

banana:2

date:4

7 entries across 5 buckets — load factor 1.40

Code: a tiny hash map with chaining

Pros & cons of chaining

Pros	Cons
Simple to implement and reason about	Extra memory per entry (the list node overhead)
Degrades gracefully under high load	Cache-unfriendly — buckets are scattered
Load factor can safely exceed 1.0	Worst case is O(n) if everything chains

Java’s HashMap uses chaining (and famously upgrades long chains to red-black trees once they exceed 8 entries).

Strategy 2: Open Addressing

Instead of a list per bucket, everything lives in the main array. On collision, probe for the next available slot.

The simplest variant — linear probing — works like this:

put(k, v):
    i = hash(k) % M
    while buckets[i] is occupied with a different key:
        i = (i + 1) % M
    buckets[i] = (k, v)

Recall the probe step — what do you do when the slot you want is taken?

python · fill in the blanks0/1 hints

def put(self, key, val):
  i = self.hash(key) % self.M
  while self.slots[i] is not None and self.slots[i].key != key:
      # ??? move to the next slot, wrapping around
  self.slots[i] = Entry(key, val)

To look up, hash to the starting slot, then walk forward until you either find the key (return its value) or hit an empty slot (key not present).

                     [   ][ A:1 ][ B:2 ][ C:3 ][   ][   ][   ]
hash("D") → 2 (taken)
hash("D") → 3 (taken)
hash("D") → 4 (empty!)
                     [   ][ A:1 ][ B:2 ][ C:3 ][ D:4 ][   ][   ]

Pros & cons of open addressing

Pros	Cons
Excellent cache locality — everything in one array	Load factor must stay well below 1.0 (~0.5–0.75)
Lower memory overhead — no list-node pointers	Deletion is tricky (need tombstones)
Faster in practice when sized correctly	Clustering can hurt performance

Predict first

In open addressing, you delete a key by simply emptying its slot (set it back to None). What breaks?

Variants try to spread probes better:

Quadratic probing — try i + 1, i + 4, i + 9, …. Better distribution than linear.
Double hashing — use a second hash function to determine the probe step. Best distribution; used by Python’s dict.

Python’s dict and C++ unordered_map use open addressing variants. Most modern hash table implementations have moved this way.

The load factor

The load factor α = n / M (entries / buckets) is the single number that controls how fast a hash table is.

α small (< 0.5): lots of empty buckets, hash collisions rare, operations close to O(1) — wastes memory.
α around 0.5–0.75: sweet spot for most implementations.
α near 1.0 with chaining: still works but slows down — buckets get long.
α near 1.0 with open addressing: catastrophic — every probe might need to scan most of the table.

When α exceeds a threshold (typically 0.75), the hash table resizes: allocate a new array of double the size, then rehash every existing entry into the new array using the new modulus.

Old: 8 buckets, 6 entries — α = 0.75 → resize!
New: 16 buckets, 6 entries — α = 0.375 → roomy again

Resizing is O(n) work. But it happens roughly every time the table doubles in size, so it’s amortized to O(1) per insert:

After resizing to size 2M, the next resize needs n = 2M × 0.75 = 1.5M new entries.
We do M work for the resize, but we add 1.5M − 0.75M = 0.75M new entries between resizes.
Cost per insert: O(M) ÷ O(M new inserts) = O(1) amortized.

This is the same amortized argument as the dynamic array (vector, list, ArrayList) doubling strategy from Day 2. A rare expensive operation, spread across many cheap ones, averages out to O(1) per operation.

What if you implement a HashMap from scratch?

A common interview question. The skeleton:

class MyHashMap:
    def __init__(self):
        self.cap = 8
        self.n = 0
        self.buckets = [[] for _ in range(self.cap)]
 
    def _idx(self, key):
        return hash(key) % self.cap
 
    def _resize(self):
        old = self.buckets
        self.cap *= 2
        self.buckets = [[] for _ in range(self.cap)]
        self.n = 0
        for bucket in old:
            for k, v in bucket:
                self.put(k, v)        # re-insert under new capacity
 
    def put(self, key, value):
        bucket = self.buckets[self._idx(key)]
        for i, (k, _) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value); return
        bucket.append((key, value))
        self.n += 1
        if self.n / self.cap > 0.75:
            self._resize()
 
    def get(self, key, default=None):
        for k, v in self.buckets[self._idx(key)]:
            if k == key: return v
        return default
 
    def remove(self, key):
        bucket = self.buckets[self._idx(key)]
        for i, (k, _) in enumerate(bucket):
            if k == key:
                bucket.pop(i); self.n -= 1; return

Six core methods. Implement them once, understand them forever, and use the stdlib version everywhere else.

Common pitfalls

The five things people get wrong:

Forgetting that keys can be None / null. Most stdlib hash maps allow null keys (one slot reserved). Hand-rolled implementations often crash.
Mutating a key after insertion. If a key’s hash changes between put and get, you’ll never find it again. Use immutable keys — Python tuples but not lists, strings but not bytearrays.
Comparing hashes instead of keys. Two different keys can have the same hash. Always do hash(k1) == hash(k2) AND k1 == k2 (and the stdlib already does this for you).
Ignoring the load factor. If you build a hash map by repeated insertion without resizing, the table degrades to O(n) per operation.
Using a slow equals / __eq__ on custom keys. Hash maps call equality every time you probe. A multi-millisecond equality is a problem.

Quick check

When the load factor crosses ~0.75, the table doubles and re-hashes every entry — O(n) work. Doesn't that make insertion O(n)?

You now know what’s under the hood — collisions, probing, load factor, resizing. Next: using hash maps — the everyday stdlib patterns (frequency counts, grouping, seen-sets, two-sum) that solve a huge share of interview problems in a few lines.

Introduction Using Hash Maps

Finished this page?