🚀 Phases 1–5 are live — Days 1–17 cover the foundations and the algorithmic patterns. See the roadmap →
Day 27 - Advanced StringsAdvanced Patterns

Advanced String Patterns

The three algorithms on the previous pages (KMP, Rabin-Karp, Z-algorithm) cover single-pattern search in linear time. This page introduces three more techniques that come up in harder problems:

  1. Manacher’s algorithm — longest palindromic substring in O(n). The palindrome version of KMP’s “don’t restart from scratch.”
  2. Aho-Corasick — multi-pattern search in O(n + total pattern length + matches). KMP on a trie.
  3. Suffix arrays — a sorted array of all suffixes, enabling O(m log n) search for any pattern and O(n log n) construction. The data structure behind many “hard” string problems.

1. Manacher’s Algorithm

The palindrome problem

Find the longest palindromic substring of s.

The brute-force expand-around-center approach tries every center, expands while characters match, and is O(n²) (n centers × n expansion per center worst case). Manacher’s does it in O(n) using the same Z-box principle.

The mirror trick

The key insight: if you’re inside a palindrome, your mirror position around the palindrome’s center gives you a free match length. Specifically, if we’re computing the radius p[i] at center i, and i is inside a known palindrome centered at c with radius r (so c - r <= i <= c + r), then p[i] >= min(p[2*c - i], c + r - i).

  • 2*c - i is the mirror of i around c.
  • p[2*c - i] is the known palindrome radius at the mirror.
  • c + r - i is how much room we have inside the known palindrome.

We take the minimum because if the mirror’s palindrome extends beyond the current palindrome’s edge, we can only guarantee matches up to the edge.

The odd/even unification trick

To handle both odd-length and even-length palindromes with one pass, insert a separator character (e.g., '#') between every character and at both ends:

s = "abacaba"
T = "#a#b#a#c#a#b#a#"

Now every palindrome in T is odd-length (since we inserted characters between every pair), and an even-length palindrome in s corresponds to a palindrome centered on a '#'.

Implementation

Time: O(n). Space: O(n).

The argument is the same as the Z-algorithm: the r pointer never decreases, every expansion increments r by 1, and r is bounded by |T| = 2n + 1. Total expansions across all i: O(n).

Manacher’s is the string algorithm most frequently cited as “magic” in interviews. It’s not. Once you see the Z-box analogy — “if I’m inside a known palindrome, my mirror tells me a free radius” — the code follows directly. The #-insertion trick is just bookkeeping to avoid separate even/odd cases.

The problem

Given a text T and a set of patterns P1, P2, ..., Pk, find all occurrences of all patterns in T.

Running KMP separately for each pattern takes O(n × k + total_pattern_length) — linear in the text for each pattern. If k = 10,000, that’s expensive. Aho-Corasick does it in O(n + total_pattern_length + number_of_matches) — one pass over the text, finding all pattern occurrences simultaneously.

The idea

Build a trie of all patterns. Then build failure links on the trie (exactly KMP’s failure function, but on the trie structure). Each failure link points from a node back to the longest proper suffix of the current path that is also a prefix of some pattern.

During the search, run the text through the automaton: on each character, follow a trie edge if one exists, or follow failure links until you find a matching edge or reach the root. Every time you reach a node that marks a complete pattern, record a match.

The automaton processes each character in O(1) amortized (same argument as KMP). The total traversal is O(n).

from collections import deque
 
def build_aho_corasick(patterns):
    # Trie: each node is {char: child_idx}, plus fail and output
    trie = [{'fail': 0, 'output': []}]
    goto = [{}]
 
    for pid, p in enumerate(patterns):
        node = 0
        for ch in p:
            if ch not in goto[node]:
                goto.append({})
                trie.append({'fail': 0, 'output': []})
                goto[node][ch] = len(trie) - 1
            node = goto[node][ch]
        trie[node]['output'].append(pid)
 
    q = deque()
    for ch, child in goto[0].items():
        trie[child]['fail'] = 0
        q.append(child)
 
    while q:
        u = q.popleft()
        for ch, v in goto[u].items():
            fail = trie[u]['fail']
            while fail and ch not in goto[fail]:
                fail = trie[fail]['fail']
            trie[v]['fail'] = goto[fail].get(ch, 0) if fail else goto[0].get(ch, 0)
            if trie[v]['fail'] == v:
                trie[v]['fail'] = 0
            trie[v]['output'] += trie[trie[v]['fail']]['output']
            q.append(v)
 
    return trie, goto
 
def aho_corasick_search(T, patterns):
    trie, goto = build_aho_corasick(patterns)
    node, results = 0, []
    for i, ch in enumerate(T):
        while node and ch not in goto[node]:
            node = trie[node]['fail']
        node = goto[node].get(ch, 0)
        for pid in trie[node]['output']:
            results.append((i - len(patterns[pid]) + 1, pid))
    return results

Aho-Corasick is the algorithm behind Unix grep (especially fgrep) and many intrusion detection systems.

3. Suffix Arrays: a conceptual sketch

What is a suffix array?

For a string s of length n, the suffix array SA is a permutation of [0, 1, ..., n-1] such that s[SA[0]..] < s[SA[1]..] < ... < s[SA[n-1]..] — the suffixes sorted in lexicographic order.

For s = "banana":

RankSuffixStarting index
0a5
1ana3
2anana1
3banana0
4na4
5nana2

SA = [5, 3, 1, 0, 4, 2].

Pattern search with a suffix array

To find all occurrences of pattern P in s, binary search on the suffix array. Any occurrence of P in s is a prefix of some suffix — so the occurrences form a contiguous range in the sorted suffix array. Two binary searches (lower bound and upper bound) find that range in O(m log n).

Construction

Naïve construction (sort all suffixes by comparison) is O(n² log n) due to O(n) comparison cost. The standard O(n log n) algorithm uses prefix doubling: sort by pairs of length-1, then length-2, then length-4, etc., doubling the comparison key each round. After log n rounds, the full sort is done.

An O(n) construction (SA-IS algorithm) exists but is complex; the O(n log n) version is standard in competitive programming.

LCP array

The Longest Common Prefix (LCP) array LCP[i] stores the length of the longest common prefix between SA[i-1] and SA[i] (adjacent suffixes in sorted order). It can be computed in O(n) from the suffix array using the Kasai algorithm.

Together, the suffix array and LCP array answer many “substring queries” in O(n log n) total preprocessing and O(m log n) per query:

  • Find pattern in text
  • Count distinct substrings: n*(n+1)/2 - sum(LCP)
  • Longest repeated substring: max(LCP)
  • Longest common substring of two strings

For interview purposes, Rabin-Karp + binary search covers most suffix array applications. The suffix array itself appears in competitive programming more than in FAANG interviews.

Summary: when to use what

ProblemBest algorithm
Find one pattern in textKMP or Z-algorithm
Find pattern, need multi-pattern generalizationRabin-Karp
Find all patterns simultaneouslyAho-Corasick
Longest palindromic substringManacher’s
Duplicate substring detectionBinary search + Rabin-Karp
Many queries on the same textSuffix array + LCP
Competitive programming stringSuffix array + Aho-Corasick

Quick check

In Manacher's algorithm, what does the p[i] array represent?
What is Aho-Corasick's key advantage over running KMP independently for each of k patterns?
How does a suffix array enable O(m log n) pattern search (instead of the naïve O(nm))?

Next: Basic Questions — warm-up exercises to lock the string patterns into muscle memory.