🚀 Phases 1–5 are live — Days 1–17 cover the foundations and the algorithmic patterns. See the roadmap →

Z-Algorithm

The Z-algorithm constructs an array Z where Z[i] is the length of the longest substring starting at position i that is also a prefix of the whole string. With this array in hand, finding a pattern in a text becomes a single linear scan.

It’s arguably the cleanest of the three linear matchers: one array, one linear scan, a clear loop invariant. Many competitive programmers prefer it over KMP precisely because the code is shorter and easier to remember.

The Z-array

For a string S of length n, define:

Z[0] = n   (by convention, or sometimes left as 0)
Z[i] = length of the longest string starting at S[i] that matches a prefix of S

Example: S = "AABXAABXCAABXAABXAY"

iS[i..]Matches prefix of lengthZ[i]
0AABXAABXCAABXAABXAY19
1ABXAABXCAABXAABXAYA? AA? No — “A” but then “B”≠“A”1
2BXAABXCAABXAABXAY”B” ≠ “A”0
3XAABXCAABXAABXAY”X” ≠ “A”0
4AABXCAABXAABXAY”AABX” then “C”≠“A”4
5ABXCAABXAABXAY”A” matches, then “B”≠“A”1

The Z-array is [19, 1, 0, 0, 4, 1, 0, 0, 0, 8, 1, 0, 0, 4, 1, 0, 0, 1, 0].

Notice Z[9] = 8: the suffix starting at position 9 shares an 8-character prefix with the full string. That means S[9..16] == S[0..7] — a useful structural fact.

Constructing the Z-array: the Z-box trick

The naive approach — for each i, count matching characters from position i and from 0 — is O(n²) in the worst case. The Z-algorithm maintains a Z-box: the interval [l, r] corresponding to the rightmost Z-value computed so far. Specifically, S[l..r] is the window that was matched to produce the current rightmost known prefix extension.

The loop invariant: S[l..r] == S[0..r-l] (i.e., the substring S[l..r] matches the prefix S[0..r-l]).

For each position i:

  • Case 1: i > ri is outside the current Z-box. We have no free information. Start comparing S[i] with S[0], S[i+1] with S[1], etc. Update [l, r] if we find a match.
  • Case 2: i <= ri is inside the Z-box. Since S[l..r] == S[0..r-l], position i corresponds to position i - l inside the prefix. We know Z[i-l] characters match from there.
    • Sub-case 2a: Z[i-l] < r - i + 1 (the known match doesn’t reach the Z-box boundary). Then Z[i] = Z[i-l] — exactly as much match as the interior position, no extension needed.
    • Sub-case 2b: Z[i-l] >= r - i + 1 (the known match reaches or exceeds the boundary). Then Z[i] >= r - i + 1, and we try to extend further from r + 1.

In both sub-cases of Case 2, we either directly read off Z[i] or start extending from a position we’ve already reached. The r pointer never decreases. Since r increases by at most n total across all iterations, the total work is O(n).

Implementation

Why the concatenation trick works

When we form S = P + "#" + T, the Z-values in the T portion tell us exactly how many characters of T[i..] match a prefix of S — which is the same as a prefix of P (since "#" can’t appear in either, it caps all Z-values at m). If Z[m + 1 + i] == m, then T[i..i+m-1] == P[0..m-1] — a full match.

The "#" separator is critical: without it, the pattern characters might blend into the text characters and inflate Z-values spuriously.

Applications

String periodicity

A string s has period p if s[i] == s[i+p] for all valid i. The Z-array directly exposes this:

def find_period(s):
    Z = z_array(s)
    n = len(s)
    for p in range(1, n):
        if Z[p] >= n - p:    # S[p..n-1] matches S[0..n-p-1]
            return p
    return n                 # period is the full length (no repetition)

If Z[p] >= n - p, then every suffix of s starting at a multiple of p matches the prefix — the string has period p.

Counting distinct substrings with suffix arrays

The Z-array on each suffix can count how many of its characters are “new” (not overlapping with any shorter suffix match). This is the core idea behind the LCP array in suffix arrays — but that’s covered in Advanced Patterns.

Z vs KMP vs Rabin-Karp

PropertyKMPZ-algorithmRabin-Karp
PreprocessingFailure fn O(m)Z-array O(n+m)Hash O(m)
SearchO(n)O(n)O(n) avg
SpaceO(m)O(n+m)O(1)
Code complexityMediumLowLow
Multiple patternsNo (use A-C)No (use A-C)Yes (hash set)
Worst case guaranteeYesYesNo
Natural extensionAho-CorasickSuffix arraysDouble hashing

For a clean interview implementation of single-pattern search, Z-algorithm is often the quickest to write correctly. For the structural application (failure function, Aho-Corasick, palindrome tricks), KMP’s failure function is more useful.

Complexity

PhaseTimeSpace
Build Z-arrayO(n)O(n)
Search via concatenationO(n + m)O(n + m)

Quick check

In the Z-algorithm, what does the sentinel '#' in the concatenation P + '#' + T prevent?
In the Z-box maintenance, when i <= r and Z[i - l] < r - i + 1, why can we set Z[i] = Z[i - l] without any extension?
Why does the Z-algorithm run in O(n) even though the inner while loop can repeat multiple times per iteration?

Next: Advanced Patterns — Manacher’s O(n) palindrome algorithm, the Aho-Corasick multi-pattern machine, and a high-level sketch of suffix arrays.