Z-Algorithm
For every position in a string, ask: “how many characters starting here match the beginning of the string?” Computing that naively is
O(n²)— re-scan from each position. But here’s the leverage: once you’ve matched a long stretch starting at some position, you already know what the characters inside it are — they equal a prefix you’ve seen. So for positions inside that stretch, you can read the answer off a mirror instead of re-comparing. That one observation collapsesO(n²)toO(n), and it’s the cleanest such argument in all of string matching.
The Z-algorithm constructs an array Z where Z[i] is the length of the longest substring starting at position i that is also a prefix of the whole string. With this array in hand, finding a pattern in a text becomes a single linear scan.
It’s arguably the cleanest of the three linear matchers: one array, one linear scan, a clear loop invariant. Many competitive programmers prefer it over KMP precisely because the code is shorter and easier to remember.
The Z-array
For a string S of length n, define:
Z[0] = n (by convention, or sometimes left as 0)
Z[i] = length of the longest string starting at S[i] that matches a prefix of SExample: S = "AABXAABXCAABXAABXAY"
| i | S[i..] | Matches prefix of length | Z[i] |
|---|---|---|---|
| 0 | AABXAABXCAABXAABXAY | — | 19 |
| 1 | ABXAABXCAABXAABXAY | A? AA? No — “A” but then “B”≠“A” | 1 |
| 2 | BXAABXCAABXAABXAY | ”B” ≠ “A” | 0 |
| 3 | XAABXCAABXAABXAY | ”X” ≠ “A” | 0 |
| 4 | AABXCAABXAABXAY | ”AABX” then “C”≠“A” | 4 |
| 5 | ABXCAABXAABXAY | ”A” matches, then “B”≠“A” | 1 |
| … | … | … |
The Z-array is [19, 1, 0, 0, 4, 1, 0, 0, 0, 8, 1, 0, 0, 4, 1, 0, 0, 1, 0].
Notice Z[9] = 8: the suffix starting at position 9 shares an 8-character prefix with the full string. That means S[9..16] == S[0..7] — a useful structural fact.
Constructing the Z-array: the Z-box trick
The naive approach — for each i, count matching characters from position i and from 0 — is O(n²) in the worst case. The Z-algorithm maintains a Z-box: the interval [l, r] corresponding to the rightmost Z-value computed so far. Specifically, S[l..r] is the window that was matched to produce the current rightmost known prefix extension.
The loop invariant: S[l..r] == S[0..r-l] (i.e., the substring S[l..r] matches the prefix S[0..r-l]).
For each position i:
- Case 1:
i > r—iis outside the current Z-box. We have no free information. Start comparingS[i]withS[0],S[i+1]withS[1], etc. Update[l, r]if we find a match. - Case 2:
i <= r—iis inside the Z-box. SinceS[l..r] == S[0..r-l], positionicorresponds to positioni - linside the prefix. We knowZ[i-l]characters match from there.- Sub-case 2a:
Z[i-l] < r - i + 1(the known match doesn’t reach the Z-box boundary). ThenZ[i] = Z[i-l]— exactly as much match as the interior position, no extension needed. - Sub-case 2b:
Z[i-l] >= r - i + 1(the known match reaches or exceeds the boundary). ThenZ[i] >= r - i + 1, and we try to extend further fromr + 1.
- Sub-case 2a:
In both sub-cases of Case 2, we either directly read off Z[i] or start extending from a position we’ve already reached. The r pointer never decreases. Since r increases by at most n total across all iterations, the total work is O(n).
Implementation
The entire Z-box optimization is one line. When i falls inside the current box [l, r], you don’t compare from scratch — you copy the mirror value Z[i - l], capped so it can’t claim more than the box guarantees. Rebuild that line:
min(r - i, Z[i - l]): the mirror position i - l already told us its match length, but we can only trust it up to the box edge r - i — past that, the box invariant says nothing, so we cap and then extend by hand in the while. Drop the min and you’d over-claim matches the box never verified.
Why the concatenation trick works
When we form S = P + "#" + T, the Z-values in the T portion tell us exactly how many characters of T[i..] match a prefix of S — which is the same as a prefix of P (since "#" can’t appear in either, it caps all Z-values at m). If Z[m + 1 + i] == m, then T[i..i+m-1] == P[0..m-1] — a full match.
The "#" separator is critical: without it, the pattern characters might blend into the text characters and inflate Z-values spuriously.
Applications
String periodicity
A string s has period p if s[i] == s[i+p] for all valid i. The Z-array directly exposes this:
def find_period(s):
Z = z_array(s)
n = len(s)
for p in range(1, n):
if Z[p] >= n - p: # S[p..n-1] matches S[0..n-p-1]
return p
return n # period is the full length (no repetition)If Z[p] >= n - p, then every suffix of s starting at a multiple of p matches the prefix — the string has period p.
Counting distinct substrings with suffix arrays
The Z-array on each suffix can count how many of its characters are “new” (not overlapping with any shorter suffix match). This is the core idea behind the LCP array in suffix arrays — but that’s covered in Advanced Patterns.
Z vs KMP vs Rabin-Karp
| Property | KMP | Z-algorithm | Rabin-Karp |
|---|---|---|---|
| Preprocessing | Failure fn O(m) | Z-array O(n+m) | Hash O(m) |
| Search | O(n) | O(n) | O(n) avg |
| Space | O(m) | O(n+m) | O(1) |
| Code complexity | Medium | Low | Low |
| Multiple patterns | No (use A-C) | No (use A-C) | Yes (hash set) |
| Worst case guarantee | Yes | Yes | No |
| Natural extension | Aho-Corasick | Suffix arrays | Double hashing |
For a clean interview implementation of single-pattern search, Z-algorithm is often the quickest to write correctly. For the structural application (failure function, Aho-Corasick, palindrome tricks), KMP’s failure function is more useful.
Complexity
| Phase | Time | Space |
|---|---|---|
| Build Z-array | O(n) | O(n) |
| Search via concatenation | O(n + m) | O(n + m) |
Quick check
The Z-box’s “trust the mirror, but only inside the verified window” is the same move you’ll see again immediately — Manacher’s palindrome algorithm is the Z-box with palindromes instead of prefixes. KMP, Rabin-Karp, and Z now give you three complete answers to single-pattern search. Next: Advanced Patterns — Manacher’s O(n) palindromes, the Aho-Corasick multi-pattern machine, and a sketch of suffix arrays, for the problems one pattern can’t cover.