Introduction to Advanced String Matching

Search for "AAAAB" inside a text of a million As. The textbook “slide the pattern and compare” does about five million character comparisons — and almost every one re-checks a character it already looked at one position ago. The fast matchers on this and the next three pages all share one refusal: never read a text character twice. Figuring out how to keep that promise is the entire subject.

You already know the string basics from Day 2 — character arrays, two-pointer reversal, anagram checking. Today is about a different problem: finding a pattern inside a much larger text. This is the core operation behind grep, search engines, DNA sequence alignment, and the String.indexOf call you’ve used without thinking.

The problem

Given a text string T of length n and a pattern string P of length m, find all positions in T where P occurs.

Formally: find all indices i such that T[i..i+m-1] == P[0..m-1].

Why naive matching is slow

The naive algorithm slides P across T one position at a time, checking character by character:

def naive_search(T, P):
    n, m = len(T), len(P)
    for i in range(n - m + 1):
        j = 0
        while j < m and T[i + j] == P[j]:
            j += 1
        if j == m:
            print(f"Match at {i}")

The whole inefficiency of naive matching is hidden in one harmless-looking line. After a partial match fails, the matcher slides forward by one and resets — throwing away everything it just learned about the text. Which line is that?

python · fill in the blanks0/1 hints

def naive_search(T, P):
  n, m = len(T), len(P)
  for i in range(n - m + 1):
      # ??? slide by 1, forget the j chars we just matched
      while j < m and T[i + j] == P[j]:
          j += 1
      if j == m:
          print("Match at", i)

j = 0 is the culprit: each new start position re-reads characters the previous pass already compared. Every fast matcher in this chapter is, at heart, a way to avoid resetting to zero — to carry forward what the last partial match told us.

On typical inputs this is fast. But consider:

T = "AAAAAAAAAB"    (n = 10)
P = "AAAAB"         (m = 5)

Predict first

For this exact T (10 chars) and P (5 chars), how many character comparisons does the naive matcher make before it finds the match? Ball-park it before reading on.

At every position i = 0, 1, 2, 3, 4, 5, the first four characters match and only the fifth fails. The inner loop runs 4 times before the restart. For a string of n A’s followed by a B and a pattern of m-1 A’s followed by a B, the total comparisons are (n - m + 1) × m — that’s O(nm).

For n = 100,000 and m = 1,000, that’s 100 million character comparisons. For genome matching (n = 3 billion, m = 100), it’s a problem.

Pattern Matching: "AAB" in "AAAAB"

Step 1 of 9

Text:

A

A

A

A

B

Pattern:

A

A

B

Start at position 0. Compare T[0]='A' with P[0]='A'. Match.

Notice what the naive algorithm wastes: at position 1, it starts comparing T[1] and P[0]. But we already know T[1] = 'A' from the previous pass. We compared it with P[1] and it matched. Why compare it again?

That is the core insight every fast matcher exploits. The question each algorithm answers differently is: given that we just failed on T[i+j] != P[j], how much text can we skip before the next attempt?

The mismatch-restart problem, stated precisely

When the naive matcher fails at position j in the pattern (after j successes), it:

Forgets that it matched T[i+1..i+j-1] against P[1..j-1].
Restarts comparison of P[0] against T[i+1].

The waste is in step 1. After matching j characters, we know the text window T[i..i+j-1] equals P[0..j-1]. If we’re smart, we can use that knowledge to skip ahead.

The key question: what is the longest prefix of P that is also a suffix of P[0..j-1]? If such a prefix exists, that’s exactly where we can resume matching — we don’t need to move back in T at all.

The three strategies

All three algorithms answer the skip question, but in different ways:

Algorithm	What it remembers	How it skips
KMP	The longest proper prefix-suffix for each prefix of `P`	On mismatch at `P[j]`, jump to `P[fail[j]]` without moving `T`
Rabin-Karp	A rolling hash of the current text window	Compare hashes first; only do character comparison on hash match
Z-algorithm	For each position, the longest string starting there that matches a prefix of `T` (or `P` concatenated with `T`)	Extend the Z-box; never re-compare inside it

They have the same O(n + m) worst case, but different constants and different strengths:

KMP is the canonical teaching algorithm; its failure function reappears in Aho-Corasick.
Rabin-Karp generalizes to multi-pattern search and duplicate detection (because the hash is the same for both the pattern and any window with that content).
Z-algorithm is simpler to prove correct and has a clean relationship to suffix arrays.

A mental model for all three

Think of sliding a window of width m across the text. At each window position, you want to know: does this window equal the pattern?

Naive: recompute the comparison from scratch each slide. O(m) per slide, O(nm) total.
KMP: know exactly where in the pattern you got stuck and resume there; guaranteed progress on every character read.
Rabin-Karp: compute a hash for each window in O(1) by rolling the previous hash; full comparison only when hashes match.
Z-algorithm: precompute a map of “prefix match lengths” for every position and read off the answer.

All four views describe the same mathematical fact — once you’ve read a character, you don’t need to read it again for pattern matching purposes — but each turns that fact into a different piece of data structure.

String algorithm complexity landscape

Before diving into each algorithm, here’s the full picture:

Algorithm	Preprocessing	Search	Space	Best for
Naive	O(1)	O(nm) worst	O(1)	Short patterns, unlikely worst case
KMP	O(m)	O(n)	O(m)	Single pattern, worst-case guarantee
Rabin-Karp	O(m)	O(n) avg / O(nm) worst	O(1)	Multi-pattern, duplicate detection
Z-algorithm	O(n+m)	O(n+m)	O(n+m)	Clean linear guarantee, many applications
Aho-Corasick	O(m_total)	O(n + matches)	O(m_total)	Matching many patterns simultaneously
Suffix array	O(n log n)	O(m log n)	O(n)	Many different queries on the same text

For a standard interview “find pattern in text” problem, any of KMP, Rabin-Karp, or Z works. The choice matters more when you’re doing something fancier: multi-pattern search (Rabin-Karp or Aho-Corasick), palindrome detection (Z-array or Manacher’s), or repeated-query substring problems (suffix array).

Quick check

The naive string search algorithm runs in O(nm) in the worst case. What input pattern triggers the worst case?

After matching T[i..i+j-1] == P[0..j-1] and failing on T[i+j] != P[j], why can we sometimes skip ahead by more than 1 position?

Rabin-Karp achieves O(n) average time by comparing hashes first. What's its worst-case time?

This extends the string fundamentals from Day 2 — character arrays and two-pointer scans — into the harder problem of finding a pattern in a huge text, and names the one idea (never re-read a character) that all three fast matchers share. Next: KMP — the failure function that turns “don’t reset j to zero” into a precomputed table and makes matching linear, guaranteed.

Overview KMP Algorithm

Finished this page?