Rabin-Karp Algorithm

Comparing a length-1000 window against the pattern costs 1000 character comparisons. Do that at every position in a million-character text and you’re back to the naive O(nm) wall. What if you could compare a whole window against the pattern in a single operation — O(1), regardless of window length? You can’t, if you’re comparing strings. So stop comparing strings: compare numbers.

KMP beats the naive matcher by remembering structural information about the pattern. Rabin-Karp takes a completely different route: instead of comparing character-by-character, it converts each window of the text into a number (a hash) and compares numbers. If the pattern’s hash doesn’t match the window’s hash, the window can’t match — skip it without any character comparisons. Only when hashes match do we verify character-by-character.

The magic is in the rolling hash: updating the hash for window T[i+1..i+m] from the hash of T[i..i+m-1] takes O(1), so we scan the entire text in O(n) time on average.

The polynomial hash

The standard choice is a polynomial hash over the characters:

hash(s[0..m-1]) = s[0]·B^(m-1) + s[1]·B^(m-2) + ... + s[m-1]·B^0  (mod Q)

where B is a base (typically a small prime like 31 for lowercase letters, or 131 for the full ASCII range) and Q is a large prime modulus (to keep numbers manageable and reduce collisions).

Rolling the hash

Predict first

Hashing one window of length m costs O(m). If we just re-hash every window from scratch, there are about n windows — what's the total? Is comparing numbers actually buying us anything yet?

When the window slides right by one position (dropping T[i] and adding T[i+m]):

hash(T[i+1..i+m]) = (hash(T[i..i+m-1]) - T[i]·B^(m-1)) · B + T[i+m]   (mod Q)

Three operations: subtract the leftmost character’s contribution, multiply by B (shift everything left), and add the new rightmost character. O(1) per slide.

The term B^(m-1) mod Q is precomputed once before the search loop.

The algorithm

Precompute the pattern hash

def poly_hash(s, B, Q):
    h = 0
    for ch in s:
        h = (h * B + ord(ch)) % Q
    return h

O(m) upfront.

Compute the first window hash

window_hash = poly_hash(T[:m], B, Q)
pattern_hash = poly_hash(P, B, Q)
high = pow(B, m - 1, Q)          # B^(m-1) mod Q — the "leading coefficient"

Slide and compare

for i in range(n - m + 1):
    if window_hash == pattern_hash:
        if T[i:i+m] == P:         # verify to rule out hash collision
            matches.append(i)
    if i < n - m:
        # roll the hash: drop T[i], add T[i+m]
        window_hash = (window_hash - ord(T[i]) * high) % Q
        window_hash = (window_hash * B + ord(T[i + m])) % Q

Everything hinges on the two-line roll. From memory: when the window slides one step, you remove the leftmost character’s contribution, shift the rest up by one power of B, and append the new character. Rebuild those two lines:

python · fill in the blanks0/2 hints

def rabin_karp(T, P, B=131, Q=10**9 + 7):
  n, m = len(T), len(P)
  pat = win = 0
  high = pow(B, m - 1, Q)          # B^(m-1) mod Q, precomputed once
  for i in range(m):
      pat = (pat * B + ord(P[i])) % Q
      win = (win * B + ord(T[i])) % Q
  matches = []
  for i in range(n - m + 1):
      if win == pat and T[i:i+m] == P:
          matches.append(i)
      if i < n - m:
          # ??? remove old leftmost char, shift, add new rightmost char (mod Q)
          # ??? remove old leftmost char, shift, add new rightmost char (mod Q)
  return matches

Line 13 subtracts the leftmost character’s leading contribution; line 14 multiplies by B (shifting every surviving character up one position) and adds the incoming character. Two arithmetic ops — O(1) per slide.

Full implementation

Hash collisions and correctness

A false positive occurs when the window hash equals the pattern hash but the strings differ. The character-level comparison T[i:i+m] == P catches this and rejects the false match. So the algorithm is always correct — the hash is only a filter.

A false negative is impossible: if T[i:i+m] == P then their hashes must match (hash functions are deterministic). The algorithm never misses a real match.

Collision probability: with a random prime Q ≈ n², the probability that any particular window gives a false positive is 1/Q ≈ 1/n². Over n windows, expected false positives ≈ n × 1/n² = 1/n → 0. In practice, collisions are very rare.

⚠️

Negative modulo trap. When rolling the hash, (winHash - T[i] * high) % Q can be negative in languages where % can return negatives (C++, Java). Always add Q before taking the modulus: (winHash - T[i] * high % Q + Q) % Q. Python’s % always returns non-negative, so this isn’t needed there — but it bites in every other language.

Where Rabin-Karp shines: duplicate substring detection

The polynomial hash is a content fingerprint — windows with the same content have the same hash (with high probability). This makes Rabin-Karp the natural choice for problems that ask about substrings of a specific length that appear more than once.

Longest Duplicate Substring (binary search + Rabin-Karp)

Find the longest substring that appears at least twice.

The length of the answer is monotone: if a substring of length k appears twice, any shorter substring also appears (at least) twice. This means we can binary search on the length and check existence.

def longest_duplicate_substring(s):
    def has_dup(length):
        if length == 0: return True
        B, Q = 131, 10**9 + 7
        h = 0
        high = pow(B, length - 1, Q)
        seen = set()
        for i in range(length):
            h = (h * B + ord(s[i])) % Q
        seen.add(h)
        for i in range(1, len(s) - length + 1):
            h = (h - ord(s[i - 1]) * high) % Q
            h = (h * B + ord(s[i + length - 1])) % Q
            if h in seen:
                return True
            seen.add(h)
        return False
 
    lo, hi = 0, len(s) - 1
    while lo < hi:
        mid = (lo + hi + 1) // 2
        if has_dup(mid): lo = mid
        else: hi = mid - 1
    return lo

Time: O(n log n) — O(log n) binary search iterations, each scanning O(n) with rolling hash. Space: O(n) for the hash set.

This is faster than the O(n²) suffix array construction (though suffix arrays can also solve it in O(n log n) with more implementation effort).

With double hashing (using two independent (B, Q) pairs), the collision probability drops from 1/Q to 1/(Q1 × Q2) ≈ 1/10^18. For competitive programming where reliability is critical, double hashing is the standard approach. For interviews, single hashing with a large prime is usually enough.

Multi-pattern search with Rabin-Karp

The standard Rabin-Karp finds one pattern. With a set of patterns, hash all patterns into a set S and check if each window’s hash is in S:

def multi_pattern_search(T, patterns):
    B, Q = 131, 10**9 + 7
    m = len(patterns[0])  # assume all same length
    pat_hashes = {poly_hash(p, B, Q): p for p in patterns}
    # ... slide window, check hash in pat_hashes, verify on hit

For patterns of different lengths, group by length and run one pass per length. This is O(n × num_lengths + total_pattern_length), which beats running KMP separately for each pattern.

For many patterns of varying lengths simultaneously, Aho-Corasick is even better.

Complexity

Operation	Time	Notes
Hash the pattern	O(m)	one-time
Hash first window	O(m)	one-time
Each slide	O(1)	rolling hash
Total slides	O(n - m)
Verification on match	O(m)	amortized rare
Total (average)	O(n + m)	with few collisions
Total (worst case)	O(nm)	pathological collisions

Quick check

How does rolling the hash when sliding the window by one position work?

Rabin-Karp always does a character comparison when a hash match is found. Why is this necessary?

Why is binary search + Rabin-Karp a valid approach for the Longest Duplicate Substring problem?

Where KMP remembered the pattern’s structure, Rabin-Karp remembers its fingerprint — and that fingerprint is reusable across many patterns and many windows, which is why it owns multi-pattern search and duplicate detection. Next: Z-algorithm — back to the deterministic, worst-case-linear world, but with the cleanest correctness proof of the three and a direct line to suffix arrays.

KMP Algorithm Z-Algorithm

Finished this page?