🚀 Phases 1–5 are live — Days 1–17 cover the foundations and the algorithmic patterns. See the roadmap →

Rabin-Karp Algorithm

KMP beats the naive matcher by remembering structural information about the pattern. Rabin-Karp takes a completely different route: instead of comparing character-by-character, it converts each window of the text into a number (a hash) and compares numbers. If the pattern’s hash doesn’t match the window’s hash, the window can’t match — skip it without any character comparisons. Only when hashes match do we verify character-by-character.

The magic is in the rolling hash: updating the hash for window T[i+1..i+m] from the hash of T[i..i+m-1] takes O(1), so we scan the entire text in O(n) time on average.

The polynomial hash

The standard choice is a polynomial hash over the characters:

hash(s[0..m-1]) = s[0]·B^(m-1) + s[1]·B^(m-2) + ... + s[m-1]·B^0  (mod Q)

where B is a base (typically a small prime like 31 for lowercase letters, or 131 for the full ASCII range) and Q is a large prime modulus (to keep numbers manageable and reduce collisions).

Rolling the hash

When the window slides right by one position (dropping T[i] and adding T[i+m]):

hash(T[i+1..i+m]) = (hash(T[i..i+m-1]) - T[i]·B^(m-1)) · B + T[i+m]   (mod Q)

Three operations: subtract the leftmost character’s contribution, multiply by B (shift everything left), and add the new rightmost character. O(1) per slide.

The term B^(m-1) mod Q is precomputed once before the search loop.

The algorithm

Precompute the pattern hash

def poly_hash(s, B, Q):
    h = 0
    for ch in s:
        h = (h * B + ord(ch)) % Q
    return h

O(m) upfront.

Compute the first window hash

window_hash = poly_hash(T[:m], B, Q)
pattern_hash = poly_hash(P, B, Q)
high = pow(B, m - 1, Q)          # B^(m-1) mod Q — the "leading coefficient"

Slide and compare

for i in range(n - m + 1):
    if window_hash == pattern_hash:
        if T[i:i+m] == P:         # verify to rule out hash collision
            matches.append(i)
    if i < n - m:
        # roll the hash: drop T[i], add T[i+m]
        window_hash = (window_hash - ord(T[i]) * high) % Q
        window_hash = (window_hash * B + ord(T[i + m])) % Q

Full implementation

Hash collisions and correctness

A false positive occurs when the window hash equals the pattern hash but the strings differ. The character-level comparison T[i:i+m] == P catches this and rejects the false match. So the algorithm is always correct — the hash is only a filter.

A false negative is impossible: if T[i:i+m] == P then their hashes must match (hash functions are deterministic). The algorithm never misses a real match.

Collision probability: with a random prime Q ≈ n², the probability that any particular window gives a false positive is 1/Q ≈ 1/n². Over n windows, expected false positives ≈ n × 1/n² = 1/n → 0. In practice, collisions are very rare.

⚠️

Negative modulo trap. When rolling the hash, (winHash - T[i] * high) % Q can be negative in languages where % can return negatives (C++, Java). Always add Q before taking the modulus: (winHash - T[i] * high % Q + Q) % Q. Python’s % always returns non-negative, so this isn’t needed there — but it bites in every other language.

Where Rabin-Karp shines: duplicate substring detection

The polynomial hash is a content fingerprint — windows with the same content have the same hash (with high probability). This makes Rabin-Karp the natural choice for problems that ask about substrings of a specific length that appear more than once.

Longest Duplicate Substring (binary search + Rabin-Karp)

Find the longest substring that appears at least twice.

The length of the answer is monotone: if a substring of length k appears twice, any shorter substring also appears (at least) twice. This means we can binary search on the length and check existence.

def longest_duplicate_substring(s):
    def has_dup(length):
        if length == 0: return True
        B, Q = 131, 10**9 + 7
        h = 0
        high = pow(B, length - 1, Q)
        seen = set()
        for i in range(length):
            h = (h * B + ord(s[i])) % Q
        seen.add(h)
        for i in range(1, len(s) - length + 1):
            h = (h - ord(s[i - 1]) * high) % Q
            h = (h * B + ord(s[i + length - 1])) % Q
            if h in seen:
                return True
            seen.add(h)
        return False
 
    lo, hi = 0, len(s) - 1
    while lo < hi:
        mid = (lo + hi + 1) // 2
        if has_dup(mid): lo = mid
        else: hi = mid - 1
    return lo

Time: O(n log n) — O(log n) binary search iterations, each scanning O(n) with rolling hash. Space: O(n) for the hash set.

This is faster than the O(n²) suffix array construction (though suffix arrays can also solve it in O(n log n) with more implementation effort).

With double hashing (using two independent (B, Q) pairs), the collision probability drops from 1/Q to 1/(Q1 × Q2) ≈ 1/10^18. For competitive programming where reliability is critical, double hashing is the standard approach. For interviews, single hashing with a large prime is usually enough.

Multi-pattern search with Rabin-Karp

The standard Rabin-Karp finds one pattern. With a set of patterns, hash all patterns into a set S and check if each window’s hash is in S:

def multi_pattern_search(T, patterns):
    B, Q = 131, 10**9 + 7
    m = len(patterns[0])  # assume all same length
    pat_hashes = {poly_hash(p, B, Q): p for p in patterns}
    # ... slide window, check hash in pat_hashes, verify on hit

For patterns of different lengths, group by length and run one pass per length. This is O(n × num_lengths + total_pattern_length), which beats running KMP separately for each pattern.

For many patterns of varying lengths simultaneously, Aho-Corasick is even better.

Complexity

OperationTimeNotes
Hash the patternO(m)one-time
Hash first windowO(m)one-time
Each slideO(1)rolling hash
Total slidesO(n - m)
Verification on matchO(m)amortized rare
Total (average)O(n + m)with few collisions
Total (worst case)O(nm)pathological collisions

Quick check

How does rolling the hash when sliding the window by one position work?
Rabin-Karp always does a character comparison when a hash match is found. Why is this necessary?
Why is binary search + Rabin-Karp a valid approach for the Longest Duplicate Substring problem?

Next: Z-algorithm — a clean linear matcher with a simpler correctness proof and a direct relationship to suffix arrays.