Z-Algorithm

For every position in a string, ask: “how many characters starting here match the beginning of the string?” Computing that naively is O(n²) — re-scan from each position. But here’s the leverage: once you’ve matched a long stretch starting at some position, you already know what the characters inside it are — they equal a prefix you’ve seen. So for positions inside that stretch, you can read the answer off a mirror instead of re-comparing. That one observation collapses O(n²) to O(n), and it’s the cleanest such argument in all of string matching.

The Z-algorithm constructs an array Z where Z[i] is the length of the longest substring starting at position i that is also a prefix of the whole string. With this array in hand, finding a pattern in a text becomes a single linear scan.

It’s arguably the cleanest of the three linear matchers: one array, one linear scan, a clear loop invariant. Many competitive programmers prefer it over KMP precisely because the code is shorter and easier to remember.

The Z-array

For a string S of length n, define:

Z[0] = n   (by convention, or sometimes left as 0)
Z[i] = length of the longest string starting at S[i] that matches a prefix of S

Example: `S = "AABXAABXCAABXAABXAY"`

i	S[i..]	Matches prefix of length	Z[i]
0	AABXAABXCAABXAABXAY	—	19
1	ABXAABXCAABXAABXAY	A? AA? No — “A” but then “B”≠“A”	1
2	BXAABXCAABXAABXAY	”B” ≠ “A”	0
3	XAABXCAABXAABXAY	”X” ≠ “A”	0
4	AABXCAABXAABXAY	”AABX” then “C”≠“A”	4
5	ABXCAABXAABXAY	”A” matches, then “B”≠“A”	1
…	…		…

Predict first

Warm up on a trivial case first. For S = 'aaaaa' (five a's), write out the full Z-array Z[0..4] before reading on.

The Z-array is [19, 1, 0, 0, 4, 1, 0, 0, 0, 8, 1, 0, 0, 4, 1, 0, 0, 1, 0].

Notice Z[9] = 8: the suffix starting at position 9 shares an 8-character prefix with the full string. That means S[9..16] == S[0..7] — a useful structural fact.

Constructing the Z-array: the Z-box trick

The naive approach — for each i, count matching characters from position i and from 0 — is O(n²) in the worst case. The Z-algorithm maintains a Z-box: the interval [l, r] corresponding to the rightmost Z-value computed so far. Specifically, S[l..r] is the window that was matched to produce the current rightmost known prefix extension.

The loop invariant: S[l..r] == S[0..r-l] (i.e., the substring S[l..r] matches the prefix S[0..r-l]).

For each position i:

Case 1: i > r — i is outside the current Z-box. We have no free information. Start comparing S[i] with S[0], S[i+1] with S[1], etc. Update [l, r] if we find a match.
Case 2: i <= r — i is inside the Z-box. Since S[l..r] == S[0..r-l], position i corresponds to position i - l inside the prefix. We know Z[i-l] characters match from there.
- Sub-case 2a: Z[i-l] < r - i + 1 (the known match doesn’t reach the Z-box boundary). Then Z[i] = Z[i-l] — exactly as much match as the interior position, no extension needed.
- Sub-case 2b: Z[i-l] >= r - i + 1 (the known match reaches or exceeds the boundary). Then Z[i] >= r - i + 1, and we try to extend further from r + 1.

In both sub-cases of Case 2, we either directly read off Z[i] or start extending from a position we’ve already reached. The r pointer never decreases. Since r increases by at most n total across all iterations, the total work is O(n).

Implementation

The entire Z-box optimization is one line. When i falls inside the current box [l, r], you don’t compare from scratch — you copy the mirror value Z[i - l], capped so it can’t claim more than the box guarantees. Rebuild that line:

python · fill in the blanks0/1 hints

def z_array(S):
  n = len(S)
  Z = [0] * n
  Z[0] = n
  l = r = 0
  for i in range(1, n):
      # ??? copy the mirror's Z-value, but never past the box edge (r - i)
      while i + Z[i] < n and S[Z[i]] == S[i + Z[i]]:
          Z[i] += 1
      if i + Z[i] > r:
          l, r = i, i + Z[i]
  return Z

min(r - i, Z[i - l]): the mirror position i - l already told us its match length, but we can only trust it up to the box edge r - i — past that, the box invariant says nothing, so we cap and then extend by hand in the while. Drop the min and you’d over-claim matches the box never verified.

Why the concatenation trick works

When we form S = P + "#" + T, the Z-values in the T portion tell us exactly how many characters of T[i..] match a prefix of S — which is the same as a prefix of P (since "#" can’t appear in either, it caps all Z-values at m). If Z[m + 1 + i] == m, then T[i..i+m-1] == P[0..m-1] — a full match.

The "#" separator is critical: without it, the pattern characters might blend into the text characters and inflate Z-values spuriously.

Applications

String periodicity

A string s has period p if s[i] == s[i+p] for all valid i. The Z-array directly exposes this:

def find_period(s):
    Z = z_array(s)
    n = len(s)
    for p in range(1, n):
        if Z[p] >= n - p:    # S[p..n-1] matches S[0..n-p-1]
            return p
    return n                 # period is the full length (no repetition)

If Z[p] >= n - p, then every suffix of s starting at a multiple of p matches the prefix — the string has period p.

Counting distinct substrings with suffix arrays

The Z-array on each suffix can count how many of its characters are “new” (not overlapping with any shorter suffix match). This is the core idea behind the LCP array in suffix arrays — but that’s covered in Advanced Patterns.

Z vs KMP vs Rabin-Karp

Property	KMP	Z-algorithm	Rabin-Karp
Preprocessing	Failure fn O(m)	Z-array O(n+m)	Hash O(m)
Search	O(n)	O(n)	O(n) avg
Space	O(m)	O(n+m)	O(1)
Code complexity	Medium	Low	Low
Multiple patterns	No (use A-C)	No (use A-C)	Yes (hash set)
Worst case guarantee	Yes	Yes	No
Natural extension	Aho-Corasick	Suffix arrays	Double hashing

For a clean interview implementation of single-pattern search, Z-algorithm is often the quickest to write correctly. For the structural application (failure function, Aho-Corasick, palindrome tricks), KMP’s failure function is more useful.

Complexity

Phase	Time	Space
Build Z-array	O(n)	O(n)
Search via concatenation	O(n + m)	O(n + m)

Quick check

In the Z-algorithm, what does the sentinel '#' in the concatenation P + '#' + T prevent?

In the Z-box maintenance, when i <= r and Z[i - l] < r - i + 1, why can we set Z[i] = Z[i - l] without any extension?

Why does the Z-algorithm run in O(n) even though the inner while loop can repeat multiple times per iteration?

The Z-box’s “trust the mirror, but only inside the verified window” is the same move you’ll see again immediately — Manacher’s palindrome algorithm is the Z-box with palindromes instead of prefixes. KMP, Rabin-Karp, and Z now give you three complete answers to single-pattern search. Next: Advanced Patterns — Manacher’s O(n) palindromes, the Aho-Corasick multi-pattern machine, and a sketch of suffix arrays, for the problems one pattern can’t cover.

Rabin-Karp Advanced Patterns

Finished this page?