Introduction to Advanced String Matching
You already know the string basics from Day 2 — character arrays, two-pointer reversal, anagram checking. Today is about a different problem: finding a pattern inside a much larger text. This is the core operation behind grep, search engines, DNA sequence alignment, and the String.indexOf call you’ve used without thinking.
The problem
Given a text string
Tof lengthnand a pattern stringPof lengthm, find all positions inTwherePoccurs.
Formally: find all indices i such that T[i..i+m-1] == P[0..m-1].
Why naive matching is slow
The naive algorithm slides P across T one position at a time, checking character by character:
def naive_search(T, P):
n, m = len(T), len(P)
for i in range(n - m + 1):
j = 0
while j < m and T[i + j] == P[j]:
j += 1
if j == m:
print(f"Match at {i}")On typical inputs this is fast. But consider:
T = "AAAAAAAAAB" (n = 10)
P = "AAAAB" (m = 5)At every position i = 0, 1, 2, 3, 4, 5, the first four characters match and only the fifth fails. The inner loop runs 4 times before the restart. For a string of n A’s followed by a B and a pattern of m-1 A’s followed by a B, the total comparisons are (n - m + 1) × m — that’s O(nm).
For n = 100,000 and m = 1,000, that’s 100 million character comparisons. For genome matching (n = 3 billion, m = 100), it’s a problem.
Notice what the naive algorithm wastes: at position 1, it starts comparing T[1] and P[0]. But we already know T[1] = 'A' from the previous pass. We compared it with P[1] and it matched. Why compare it again?
That is the core insight every fast matcher exploits. The question each algorithm answers differently is: given that we just failed on T[i+j] != P[j], how much text can we skip before the next attempt?
The mismatch-restart problem, stated precisely
When the naive matcher fails at position j in the pattern (after j successes), it:
- Forgets that it matched
T[i+1..i+j-1]againstP[1..j-1]. - Restarts comparison of
P[0]againstT[i+1].
The waste is in step 1. After matching j characters, we know the text window T[i..i+j-1] equals P[0..j-1]. If we’re smart, we can use that knowledge to skip ahead.
The key question: what is the longest prefix of P that is also a suffix of P[0..j-1]? If such a prefix exists, that’s exactly where we can resume matching — we don’t need to move back in T at all.
The three strategies
All three algorithms answer the skip question, but in different ways:
| Algorithm | What it remembers | How it skips |
|---|---|---|
| KMP | The longest proper prefix-suffix for each prefix of P | On mismatch at P[j], jump to P[fail[j]] without moving T |
| Rabin-Karp | A rolling hash of the current text window | Compare hashes first; only do character comparison on hash match |
| Z-algorithm | For each position, the longest string starting there that matches a prefix of T (or P concatenated with T) | Extend the Z-box; never re-compare inside it |
They have the same O(n + m) worst case, but different constants and different strengths:
- KMP is the canonical teaching algorithm; its failure function reappears in Aho-Corasick.
- Rabin-Karp generalizes to multi-pattern search and duplicate detection (because the hash is the same for both the pattern and any window with that content).
- Z-algorithm is simpler to prove correct and has a clean relationship to suffix arrays.
A mental model for all three
Think of sliding a window of width m across the text. At each window position, you want to know: does this window equal the pattern?
- Naive: recompute the comparison from scratch each slide. O(m) per slide, O(nm) total.
- KMP: know exactly where in the pattern you got stuck and resume there; guaranteed progress on every character read.
- Rabin-Karp: compute a hash for each window in O(1) by rolling the previous hash; full comparison only when hashes match.
- Z-algorithm: precompute a map of “prefix match lengths” for every position and read off the answer.
All four views describe the same mathematical fact — once you’ve read a character, you don’t need to read it again for pattern matching purposes — but each turns that fact into a different piece of data structure.
String algorithm complexity landscape
Before diving into each algorithm, here’s the full picture:
| Algorithm | Preprocessing | Search | Space | Best for |
|---|---|---|---|---|
| Naive | O(1) | O(nm) worst | O(1) | Short patterns, unlikely worst case |
| KMP | O(m) | O(n) | O(m) | Single pattern, worst-case guarantee |
| Rabin-Karp | O(m) | O(n) avg / O(nm) worst | O(1) | Multi-pattern, duplicate detection |
| Z-algorithm | O(n+m) | O(n+m) | O(n+m) | Clean linear guarantee, many applications |
| Aho-Corasick | O(m_total) | O(n + matches) | O(m_total) | Matching many patterns simultaneously |
| Suffix array | O(n log n) | O(m log n) | O(n) | Many different queries on the same text |
For a standard interview “find pattern in text” problem, any of KMP, Rabin-Karp, or Z works. The choice matters more when you’re doing something fancier: multi-pattern search (Rabin-Karp or Aho-Corasick), palindrome detection (Z-array or Manacher’s), or repeated-query substring problems (suffix array).
Quick check
Next: KMP — the failure function that makes pattern matching linear, guaranteed.