Rabin-Karp Algorithm
KMP beats the naive matcher by remembering structural information about the pattern. Rabin-Karp takes a completely different route: instead of comparing character-by-character, it converts each window of the text into a number (a hash) and compares numbers. If the pattern’s hash doesn’t match the window’s hash, the window can’t match — skip it without any character comparisons. Only when hashes match do we verify character-by-character.
The magic is in the rolling hash: updating the hash for window T[i+1..i+m] from the hash of T[i..i+m-1] takes O(1), so we scan the entire text in O(n) time on average.
The polynomial hash
The standard choice is a polynomial hash over the characters:
hash(s[0..m-1]) = s[0]·B^(m-1) + s[1]·B^(m-2) + ... + s[m-1]·B^0 (mod Q)where B is a base (typically a small prime like 31 for lowercase letters, or 131 for the full ASCII range) and Q is a large prime modulus (to keep numbers manageable and reduce collisions).
Rolling the hash
When the window slides right by one position (dropping T[i] and adding T[i+m]):
hash(T[i+1..i+m]) = (hash(T[i..i+m-1]) - T[i]·B^(m-1)) · B + T[i+m] (mod Q)Three operations: subtract the leftmost character’s contribution, multiply by B (shift everything left), and add the new rightmost character. O(1) per slide.
The term B^(m-1) mod Q is precomputed once before the search loop.
The algorithm
Precompute the pattern hash
def poly_hash(s, B, Q):
h = 0
for ch in s:
h = (h * B + ord(ch)) % Q
return hO(m) upfront.
Compute the first window hash
window_hash = poly_hash(T[:m], B, Q)
pattern_hash = poly_hash(P, B, Q)
high = pow(B, m - 1, Q) # B^(m-1) mod Q — the "leading coefficient"Slide and compare
for i in range(n - m + 1):
if window_hash == pattern_hash:
if T[i:i+m] == P: # verify to rule out hash collision
matches.append(i)
if i < n - m:
# roll the hash: drop T[i], add T[i+m]
window_hash = (window_hash - ord(T[i]) * high) % Q
window_hash = (window_hash * B + ord(T[i + m])) % QFull implementation
Hash collisions and correctness
A false positive occurs when the window hash equals the pattern hash but the strings differ. The character-level comparison T[i:i+m] == P catches this and rejects the false match. So the algorithm is always correct — the hash is only a filter.
A false negative is impossible: if T[i:i+m] == P then their hashes must match (hash functions are deterministic). The algorithm never misses a real match.
Collision probability: with a random prime Q ≈ n², the probability that any particular window gives a false positive is 1/Q ≈ 1/n². Over n windows, expected false positives ≈ n × 1/n² = 1/n → 0. In practice, collisions are very rare.
Negative modulo trap. When rolling the hash, (winHash - T[i] * high) % Q can be negative in languages where % can return negatives (C++, Java). Always add Q before taking the modulus: (winHash - T[i] * high % Q + Q) % Q. Python’s % always returns non-negative, so this isn’t needed there — but it bites in every other language.
Where Rabin-Karp shines: duplicate substring detection
The polynomial hash is a content fingerprint — windows with the same content have the same hash (with high probability). This makes Rabin-Karp the natural choice for problems that ask about substrings of a specific length that appear more than once.
Longest Duplicate Substring (binary search + Rabin-Karp)
Find the longest substring that appears at least twice.
The length of the answer is monotone: if a substring of length k appears twice, any shorter substring also appears (at least) twice. This means we can binary search on the length and check existence.
def longest_duplicate_substring(s):
def has_dup(length):
if length == 0: return True
B, Q = 131, 10**9 + 7
h = 0
high = pow(B, length - 1, Q)
seen = set()
for i in range(length):
h = (h * B + ord(s[i])) % Q
seen.add(h)
for i in range(1, len(s) - length + 1):
h = (h - ord(s[i - 1]) * high) % Q
h = (h * B + ord(s[i + length - 1])) % Q
if h in seen:
return True
seen.add(h)
return False
lo, hi = 0, len(s) - 1
while lo < hi:
mid = (lo + hi + 1) // 2
if has_dup(mid): lo = mid
else: hi = mid - 1
return loTime: O(n log n) — O(log n) binary search iterations, each scanning O(n) with rolling hash. Space: O(n) for the hash set.
This is faster than the O(n²) suffix array construction (though suffix arrays can also solve it in O(n log n) with more implementation effort).
With double hashing (using two independent (B, Q) pairs), the collision probability drops from 1/Q to 1/(Q1 × Q2) ≈ 1/10^18. For competitive programming where reliability is critical, double hashing is the standard approach. For interviews, single hashing with a large prime is usually enough.
Multi-pattern search with Rabin-Karp
The standard Rabin-Karp finds one pattern. With a set of patterns, hash all patterns into a set S and check if each window’s hash is in S:
def multi_pattern_search(T, patterns):
B, Q = 131, 10**9 + 7
m = len(patterns[0]) # assume all same length
pat_hashes = {poly_hash(p, B, Q): p for p in patterns}
# ... slide window, check hash in pat_hashes, verify on hitFor patterns of different lengths, group by length and run one pass per length. This is O(n × num_lengths + total_pattern_length), which beats running KMP separately for each pattern.
For many patterns of varying lengths simultaneously, Aho-Corasick is even better.
Complexity
| Operation | Time | Notes |
|---|---|---|
| Hash the pattern | O(m) | one-time |
| Hash first window | O(m) | one-time |
| Each slide | O(1) | rolling hash |
| Total slides | O(n - m) | |
| Verification on match | O(m) | amortized rare |
| Total (average) | O(n + m) | with few collisions |
| Total (worst case) | O(nm) | pathological collisions |
Quick check
Next: Z-algorithm — a clean linear matcher with a simpler correctness proof and a direct relationship to suffix arrays.