Z-Algorithm
The Z-algorithm constructs an array Z where Z[i] is the length of the longest substring starting at position i that is also a prefix of the whole string. With this array in hand, finding a pattern in a text becomes a single linear scan.
It’s arguably the cleanest of the three linear matchers: one array, one linear scan, a clear loop invariant. Many competitive programmers prefer it over KMP precisely because the code is shorter and easier to remember.
The Z-array
For a string S of length n, define:
Z[0] = n (by convention, or sometimes left as 0)
Z[i] = length of the longest string starting at S[i] that matches a prefix of SExample: S = "AABXAABXCAABXAABXAY"
| i | S[i..] | Matches prefix of length | Z[i] |
|---|---|---|---|
| 0 | AABXAABXCAABXAABXAY | — | 19 |
| 1 | ABXAABXCAABXAABXAY | A? AA? No — “A” but then “B”≠“A” | 1 |
| 2 | BXAABXCAABXAABXAY | ”B” ≠ “A” | 0 |
| 3 | XAABXCAABXAABXAY | ”X” ≠ “A” | 0 |
| 4 | AABXCAABXAABXAY | ”AABX” then “C”≠“A” | 4 |
| 5 | ABXCAABXAABXAY | ”A” matches, then “B”≠“A” | 1 |
| … | … | … |
The Z-array is [19, 1, 0, 0, 4, 1, 0, 0, 0, 8, 1, 0, 0, 4, 1, 0, 0, 1, 0].
Notice Z[9] = 8: the suffix starting at position 9 shares an 8-character prefix with the full string. That means S[9..16] == S[0..7] — a useful structural fact.
Constructing the Z-array: the Z-box trick
The naive approach — for each i, count matching characters from position i and from 0 — is O(n²) in the worst case. The Z-algorithm maintains a Z-box: the interval [l, r] corresponding to the rightmost Z-value computed so far. Specifically, S[l..r] is the window that was matched to produce the current rightmost known prefix extension.
The loop invariant: S[l..r] == S[0..r-l] (i.e., the substring S[l..r] matches the prefix S[0..r-l]).
For each position i:
- Case 1:
i > r—iis outside the current Z-box. We have no free information. Start comparingS[i]withS[0],S[i+1]withS[1], etc. Update[l, r]if we find a match. - Case 2:
i <= r—iis inside the Z-box. SinceS[l..r] == S[0..r-l], positionicorresponds to positioni - linside the prefix. We knowZ[i-l]characters match from there.- Sub-case 2a:
Z[i-l] < r - i + 1(the known match doesn’t reach the Z-box boundary). ThenZ[i] = Z[i-l]— exactly as much match as the interior position, no extension needed. - Sub-case 2b:
Z[i-l] >= r - i + 1(the known match reaches or exceeds the boundary). ThenZ[i] >= r - i + 1, and we try to extend further fromr + 1.
- Sub-case 2a:
In both sub-cases of Case 2, we either directly read off Z[i] or start extending from a position we’ve already reached. The r pointer never decreases. Since r increases by at most n total across all iterations, the total work is O(n).
Implementation
Why the concatenation trick works
When we form S = P + "#" + T, the Z-values in the T portion tell us exactly how many characters of T[i..] match a prefix of S — which is the same as a prefix of P (since "#" can’t appear in either, it caps all Z-values at m). If Z[m + 1 + i] == m, then T[i..i+m-1] == P[0..m-1] — a full match.
The "#" separator is critical: without it, the pattern characters might blend into the text characters and inflate Z-values spuriously.
Applications
String periodicity
A string s has period p if s[i] == s[i+p] for all valid i. The Z-array directly exposes this:
def find_period(s):
Z = z_array(s)
n = len(s)
for p in range(1, n):
if Z[p] >= n - p: # S[p..n-1] matches S[0..n-p-1]
return p
return n # period is the full length (no repetition)If Z[p] >= n - p, then every suffix of s starting at a multiple of p matches the prefix — the string has period p.
Counting distinct substrings with suffix arrays
The Z-array on each suffix can count how many of its characters are “new” (not overlapping with any shorter suffix match). This is the core idea behind the LCP array in suffix arrays — but that’s covered in Advanced Patterns.
Z vs KMP vs Rabin-Karp
| Property | KMP | Z-algorithm | Rabin-Karp |
|---|---|---|---|
| Preprocessing | Failure fn O(m) | Z-array O(n+m) | Hash O(m) |
| Search | O(n) | O(n) | O(n) avg |
| Space | O(m) | O(n+m) | O(1) |
| Code complexity | Medium | Low | Low |
| Multiple patterns | No (use A-C) | No (use A-C) | Yes (hash set) |
| Worst case guarantee | Yes | Yes | No |
| Natural extension | Aho-Corasick | Suffix arrays | Double hashing |
For a clean interview implementation of single-pattern search, Z-algorithm is often the quickest to write correctly. For the structural application (failure function, Aho-Corasick, palindrome tricks), KMP’s failure function is more useful.
Complexity
| Phase | Time | Space |
|---|---|---|
| Build Z-array | O(n) | O(n) |
| Search via concatenation | O(n + m) | O(n + m) |
Quick check
Next: Advanced Patterns — Manacher’s O(n) palindrome algorithm, the Aho-Corasick multi-pattern machine, and a high-level sketch of suffix arrays.