22 / 22 · Tier C
Patterns / 22

String matching

Three tools, one job. Naive substring search is O(nm); for any non-trivial input that's a TLE. KMP and Z give you O(n + m) with a precomputed failure function. Rolling hash gives you O(1) substring comparison — fast on average, sometimes catastrophic on adversarial inputs. Knowing when to reach for which is the senior-string-problem skill.


1 · Why this pattern matters

The naive "for every position in text, compare against pattern" approach is O(nm). Fine for n, m ≤ 100. Anything larger and you're paying for repeated comparisons of overlapping prefixes. The whole point of KMP, Z, and rolling hash is that those repetitions can be precomputed.

Common LC shapes where these tools earn their keep:

  • LC 28 strStr. The textbook "find pattern in text" — both KMP and rolling hash are clean answers.
  • LC 459 Repeated Substring Pattern. "Is the string a repetition of a smaller substring?" The KMP failure function decides this in one expression.
  • LC 214 Shortest Palindrome. Build the shortest palindrome by prepending characters. Solved by KMP on s + '#' + reverse(s).
  • LC 1392 Longest Happy Prefix. "Longest proper prefix that is also a suffix." This is literally what the KMP failure function computes.
  • LC 1316 Distinct Echo Substrings. Find all substrings that are concatenations of a string with itself. Rolling hash for O(1) substring comparison.
The signal. Whenever the problem mentions "find pattern in text", "longest prefix that is also a suffix", "longest match starting at every position", or "compare many substrings", these tools are the answer. The naive approach passes small tests, dies on the hidden long-string case.

2 · How to recognise it

  • Naive substring search is too slow. Constraints say |s|, |t| ≤ 10⁵. O(nm) is 10¹⁰ ops — TLE.
  • You need ALL occurrences of pattern in text. Not just "does it appear" — KMP / Z find every match in O(n + m).
  • "Longest proper prefix that is also a suffix". Anything phrased this way is a direct KMP failure-function readout.
  • Compare many substrings of the same string. Rolling hash gives O(1) per comparison after O(n) preprocessing.
  • Palindrome anywhere in linear time. Manacher's algorithm — a string-matching cousin — finds all palindromic substrings in O(n).
  • Multiple patterns vs one text. Aho-Corasick generalises KMP to a trie of patterns; one pass through the text finds them all.
What it is not. If the question is "does X appear as a subsequence" (not contiguous), these tools don't apply. Subsequence matching is two-pointer / DP territory. KMP, Z, hash all assume contiguous matches.

3 · KMP — the failure function

Knuth–Morris–Pratt's insight: when a mismatch happens at position j in the pattern, you don't need to back up in the text. Some suffix of pattern[0..j−1] is already a prefix of pattern — slide the pattern forward by enough to align that prefix, and continue from there. The failure function fail[i] stores the length of the longest proper prefix of pattern[0..i] that is also a suffix.

"Proper" means strictly shorter than the substring itself. For the pattern "abab": fail = [0, 0, 1, 2]. The "ab" prefix is also a suffix of the whole thing — length 2.

// Build KMP failure function for pattern p.
// fail[i] = length of longest proper prefix of p[0..i] that is also a suffix.
func kmpFail(p string) []int {
    n := len(p)
    fail := make([]int, n)
    k := 0                          // length of current matched prefix
    for i := 1; i < n; i++ {
        // Roll back via fail-chain until either p[k] == p[i] or k == 0
        for k > 0 && p[k] != p[i] {
            k = fail[k-1]
        }
        if p[k] == p[i] {
            k++
        }
        fail[i] = k
    }
    return fail
}

The construction is amortised O(m). The inner for loop looks scary but the total decrement of k across all iterations is bounded by the total increment — and k increments at most once per outer step. Aggregate analysis gives O(m).

The roll-back chain. When p[k] != p[i], we jump k = fail[k-1]. This says "the previous best match ended at position k-1; the best shorter match is whatever fail[k-1] records". Following the chain fail[fail[k-1] - 1] etc. enumerates all proper prefix-suffix lengths in decreasing order. Useful for problems like LC 459 (repeated pattern) where you want the second-longest, not the longest.

4 · KMP — matching

Once you have fail for the pattern, matching against text t is the same loop with two strings instead of one. Maintain a counter j for how much of the pattern matches the current suffix of text seen so far. On a character match, advance j; on a mismatch, roll back j using fail until either you find a partial match or hit zero.

// Find all starting indices in t where p occurs.
func kmpSearch(t, p string) []int {
    if len(p) == 0 {
        return []int{0}
    }
    fail := kmpFail(p)
    var out []int
    j := 0
    for i := 0; i < len(t); i++ {
        for j > 0 && t[i] != p[j] {
            j = fail[j-1]           // fall back along the fail-chain
        }
        if t[i] == p[j] {
            j++
        }
        if j == len(p) {
            out = append(out, i-len(p)+1)
            j = fail[j-1]            // continue searching for overlapping matches
        }
    }
    return out
}

Total cost O(n + m): O(m) to build fail, O(n) to scan t. The text pointer i never decreases — that's the whole point. Naive search restarts at i + 1 on every mismatch; KMP just adjusts j and keeps reading.

Why the final j = fail[j-1]. After reporting a match at i − m + 1, the algorithm needs to keep searching for overlapping matches. Setting j = fail[m−1] restarts the matcher in a state equivalent to "we've already matched the suffix of the pattern that equals its longest prefix". For pattern "aaa" in text "aaaaa", this finds three overlapping matches at indices 0, 1, 2.

5 · Z-algorithm

The Z-array gives, for every position i in a string, z[i] = length of the longest substring starting at i that matches a prefix of the string. By convention z[0] = n (or 0; the spec varies). The Z-array gives you the same matching power as KMP, often in fewer lines.

The classic Z computation maintains a "Z-box" [l, r] — the rightmost prefix-match-window seen so far. For each new i, if i ≤ r, copy the answer from z[i − l] as a head-start; then extend by comparing characters past r. Amortised O(n).

// Build the Z-array. z[i] = longest prefix of s matching s[i..].
func zArray(s string) []int {
    n := len(s)
    z := make([]int, n)
    if n == 0 {
        return z
    }
    z[0] = n
    l, r := 0, 0
    for i := 1; i < n; i++ {
        if i < r {
            z[i] = min(r-i, z[i-l])  // head-start from the Z-box
        }
        for i+z[i] < n && s[z[i]] == s[i+z[i]] {
            z[i]++                   // extend
        }
        if i+z[i] > r {
            l, r = i, i+z[i]         // new rightmost Z-box
        }
    }
    return z
}

func min(a, b int) int { if a < b { return a }; return b }

To find all matches of pattern p in text t, build z on the string p + '#' + t where '#' is any character not in either. Then for every i > len(p) with z[i] == len(p), you've found a match. The separator prevents the Z-box from running past the pattern into the text region during construction.

6 · KMP vs Z — when to reach for which

Both are O(n + m) and either solves the basic "find pattern in text" problem. The differences are about what the auxiliary array also tells you.

Use caseReach forWhy
Match-anywhere searchZ (on p + '#' + t)Z directly tells you match positions; one pass, no separate matcher.
"Longest proper prefix = suffix"KMP failure functionThe failure function IS the answer. fail[n−1] is it.
Stream of text, fixed patternKMPState j is updated incrementally as text arrives; no need to buffer.
Period of a stringKMP failure functionn − fail[n−1] is the minimal period iff it divides n.
"Every position's longest prefix match"ZZ gives this directly; KMP would require reversing roles.
Multi-pattern matchingAho-Corasick (KMP generalisation)Build a trie of patterns + failure links; one O(n + Σm) pass over text.
"Is s a rotation of t?"Either: KMP-search for s in t + tIf |s| == |t| and KMP finds a match, yes. O(n).
Default to Z for one-off matching. The Z-algorithm is shorter to write and easier to reason about for "find all occurrences" problems. Default to KMP for "longest prefix-suffix" questions and for any problem where you need to interpret the failure function itself (period detection, repeated pattern, shortest palindrome).

7 · Rolling hash (polynomial)

Treat a string as a base-B number modulo a large prime P. Precompute prefix hashes h[0..n] where h[i] is the hash of s[0..i−1]. Then the hash of any substring s[l..r] is (h[r+1] − h[l] · B^(r−l+1)) mod P, computable in O(1).

This buys you O(1) substring equality (subject to collisions). Two substrings are equal if their hashes match — almost certainly. To make "almost" close enough to "certainly" on adversarial inputs, use two independent hash functions (different bases and primes) and consider substrings equal only if both hashes match. Collision probability drops to about 1/P² — astronomically small for P ≈ 10⁹.

// Polynomial rolling hash with double-hash to dodge adversarial collisions.
const (
    B1 = 131
    M1 = 1_000_000_007
    B2 = 137
    M2 = 998_244_353
)

type RollingHash struct {
    h1, h2  []int64       // prefix hashes
    p1, p2  []int64       // base powers
}

func NewRollingHash(s string) *RollingHash {
    n := len(s)
    rh := &RollingHash{
        h1: make([]int64, n+1),
        h2: make([]int64, n+1),
        p1: make([]int64, n+1),
        p2: make([]int64, n+1),
    }
    rh.p1[0], rh.p2[0] = 1, 1
    for i := 0; i < n; i++ {
        rh.h1[i+1] = (rh.h1[i]*B1 + int64(s[i])) % M1
        rh.h2[i+1] = (rh.h2[i]*B2 + int64(s[i])) % M2
        rh.p1[i+1] = (rh.p1[i] * B1) % M1
        rh.p2[i+1] = (rh.p2[i] * B2) % M2
    }
    return rh
}

// Hash of s[l..r] inclusive, returned as a (h1, h2) pair.
func (rh *RollingHash) Hash(l, r int) (int64, int64) {
    a := (rh.h1[r+1] - rh.h1[l]*rh.p1[r-l+1]) % M1
    if a < 0 {
        a += M1
    }
    b := (rh.h2[r+1] - rh.h2[l]*rh.p2[r-l+1]) % M2
    if b < 0 {
        b += M2
    }
    return a, b
}

// Equal returns true iff s[l1..r1] == s[l2..r2] by double-hash comparison.
func (rh *RollingHash) Equal(l1, r1, l2, r2 int) bool {
    if r1-l1 != r2-l2 {
        return false
    }
    a1, b1 := rh.Hash(l1, r1)
    a2, b2 := rh.Hash(l2, r2)
    return a1 == a2 && b1 == b2
}
Why double-hash. Single-hash with a known base/mod is breakable — there exist adversarial inputs that force collisions (Codeforces has had problems specifically designed to break single 10⁹+7 hash). Double-hash with two coprime large primes and randomised bases is effectively uncrackable; cumulative collision probability over 10⁶ comparisons stays below 10⁻⁹. Pick bases at runtime if you really care.

The Rabin-Karp algorithm is rolling hash plus a sliding window over the text — at each position, compute the hash of text[i..i+m−1] and compare to the pattern hash. With prefix hashes this is the Hash(i, i+m−1) call; the "rolling" part (subtract leaving char, add entering char) is an implementation detail you only need if you can't precompute.

8 · Worked example 1 — LC 28 strStr (easy)

Problem. Implement strStr: return the index of the first occurrence of needle in haystack, or −1.

The textbook substring search. Two clean answers — KMP and rolling hash. Both O(n + m). Pick whichever you remember the template for.

KMP solution

func strStr(haystack, needle string) int {
    if len(needle) == 0 {
        return 0
    }
    fail := kmpFail(needle)
    j := 0
    for i := 0; i < len(haystack); i++ {
        for j > 0 && haystack[i] != needle[j] {
            j = fail[j-1]
        }
        if haystack[i] == needle[j] {
            j++
        }
        if j == len(needle) {
            return i - len(needle) + 1
        }
    }
    return -1
}

Rolling-hash solution

func strStr(haystack, needle string) int {
    n, m := len(haystack), len(needle)
    if m == 0 {
        return 0
    }
    if m > n {
        return -1
    }
    rh := NewRollingHash(haystack)
    rhN := NewRollingHash(needle)
    tgtA, tgtB := rhN.Hash(0, m-1)
    for i := 0; i+m <= n; i++ {
        a, b := rh.Hash(i, i+m-1)
        if a == tgtA && b == tgtB {
            return i
        }
    }
    return -1
}

Both run in O(n + m). The rolling-hash version is shorter and easier to debug; the KMP version is unaffected by adversarial inputs. For an interview, write KMP if asked about worst-case complexity; write rolling hash if asked for "simplest correct" with average-case good enough.

9 · Worked example 2 — LC 1392 Longest Happy Prefix (medium)

Problem. Given a string s, return the longest string that is both a proper prefix and a proper suffix of s. Return empty string if none exists.

"Proper prefix = suffix" is literally the definition of the KMP failure function at the last position. fail[n−1] gives the length; slice s[:fail[n−1]] gives the substring. Two lines after the build.

Steps

  1. Build fail for s. O(n).
  2. Read off fail[n − 1]: the length of the longest proper prefix of s that is also a suffix.
  3. Return s[:fail[n − 1]].
func longestPrefix(s string) string {
    n := len(s)
    if n == 0 {
        return ""
    }
    fail := kmpFail(s)
    return s[:fail[n-1]]
}

Total O(n). The whole problem is a one-line readout once you've internalised what the failure function means. This is why the KMP construction is worth memorising — half a dozen LC problems collapse to a single index access on fail.

The rolling-hash alternative. You could also iterate k from n − 1 down to 1 and check whether hash(s[:k]) == hash(s[n−k:]). Stop at the first match. Same answer, average O(n) (you usually hit a match early), but worst-case O(n²) and risks collisions. KMP is the right tool.

10 · Worked example 3 — LC 214 Shortest Palindrome (hard)

Problem. Given a string s, prepend the fewest characters in front to make it a palindrome. Return the resulting palindrome.

The trick: find the longest palindromic prefix of s. Everything after that prefix is what needs to be mirrored and prepended. So the question reduces to "what's the longest prefix of s that equals its own reverse?"

Build the string s + '#' + reverse(s) and run KMP on it. The failure function's last value gives exactly the longest prefix of s that matches some suffix of reverse(s) — which is the longest palindromic prefix of s. The separator '#' prevents the match from running past the boundary.

Steps

  1. Construct combined = s + "#" + reverse(s). The separator must be a character not in s.
  2. Compute fail for combined. Read k = fail[len(combined) − 1] — the length of the longest palindromic prefix of s.
  3. The tail of s not covered by this prefix is s[k:]. Prepend its reverse to s: reverse(s[k:]) + s.
func shortestPalindrome(s string) string {
    if len(s) == 0 {
        return s
    }
    rev := reverse(s)
    combined := s + "#" + rev
    fail := kmpFail(combined)
    k := fail[len(combined)-1]      // longest palindromic prefix length
    return reverse(s[k:]) + s
}

func reverse(s string) string {
    b := []byte(s)
    for i, j := 0, len(b)-1; i < j; i, j = i+1, j-1 {
        b[i], b[j] = b[j], b[i]
    }
    return string(b)
}

Total O(n). The clever bit is reframing "longest palindromic prefix" as "longest prefix of s that is also a suffix of reverse(s)" — which is exactly the KMP failure function on the concatenation.

Why the separator matters. Without '#', the KMP failure function might find matches that span the boundary between s and reverse(s). For s = "aa", s + reverse(s) = "aaaa" would give fail[3] = 3 — the whole string — which is wrong. The separator forces the match to terminate at the boundary, giving fail[len − 1] ≤ len(s) always.

11 · Sister algorithms

KMP, Z, and hash are the workhorses. A few neighbours come up in harder problems — worth knowing by name and shape.

Manacher's algorithm — all palindromes in O(n).
Finds the longest palindromic substring centred at every position. Insert separators ('#') between characters to unify odd-length and even-length palindromes; maintain a "right boundary" similar to Z's Z-box. The whole string's palindromic structure in linear time. Used in LC 5, LC 647, LC 1960. Cleaner than Z + DP, faster than DP, but the constant factor and the index gymnastics make it the algorithm most people skip in interviews.
Aho-Corasick — multi-pattern matching.
KMP generalised to a set of patterns. Build a trie of the patterns; add KMP-style failure links pointing to the longest proper suffix that's also a prefix of some pattern. One O(n + Σm + occurrences) pass over the text finds all matches of any pattern. Used in grep -F, intrusion detection (Snort), and DNA pattern banks. Standard answer to "given a dictionary of forbidden words, find them all in this document".
Suffix array + LCP.
Sort all suffixes of a string. The suffix array gives you a sorted list of starting indices; the LCP (longest common prefix) array gives the prefix length between adjacent suffixes. Together they answer "longest repeated substring", "number of distinct substrings", "longest common substring of two strings". Build in O(n log n) via DC3 or O(n) via SA-IS. Heavyweight, but many CP problems collapse to "compute SA, then do a sweep".
Suffix automaton.
The minimal DFA recognising all substrings of a string. Build in O(n), supports "is substring", "count occurrences of substring", "k-th distinct substring", and "longest common substring of multiple strings". The hardest of these to implement from memory, but its operations are O(|query|), the asymptotic optimum.
Suffix tree.
The trie of all suffixes of a string, with edge labels compressed. Same problems as suffix array, often the same complexity, more pointer-heavy. Mostly historical now — suffix array + LCP wins on cache behaviour and implementation simplicity.

12 · How to solve hard problems step-by-step

A meta-process for any string-matching problem. Four steps.

  1. Frame the question. Is it "find pattern in text", "longest prefix that is also a suffix", "all palindromes", or "compare many substrings"? The frame chooses the algorithm.
  2. Pick the tool. Pattern-vs-text → KMP or Z (or hash if you trust collisions). Prefix-suffix → KMP failure function. Palindromes → Manacher. Multiple substring comparisons in the same string → rolling hash with prefix-hash preprocessing.
  3. Construct the auxiliary array. KMP needs fail on the pattern (or on a pattern + separator + text). Z needs z on the same combined string. Rolling hash needs h[] and pow[] on the source string.
  4. Defend against the failure mode. KMP: roll-back edge cases (zero-length pattern, all-equal characters). Z: index-zero convention, Z-box boundary. Hash: collisions — use double-hash unless the problem is small. Manacher: index conversion between original and separator-padded string.
The reframing skill. Many string problems don't immediately look like substring search but become trivial after reframing. "Shortest palindrome by prepending" → "longest palindromic prefix" → "KMP on s + '#' + reverse(s)". "Is t a rotation of s?" → "is t a substring of s + s?". "Find a period of length k" → "is the suffix of length n − k a prefix?". Recognising the reframe is the senior skill; the implementation is rote.

13 · Complexity & when not to use

AlgorithmPreprocessMatchSpaceWhen to skip
NaiveO(nm)O(1)Anything past n, m ≈ 10³.
KMPO(m)O(n)O(m)Multi-pattern: use Aho-Corasick.
ZO(n + m)O(n + m)O(n + m)Streaming text: use KMP (stateful).
Rolling hash (single)O(n)O(n) avg, O(nm) worstO(n)Adversarial inputs: use double-hash or KMP.
Rolling hash (double)O(n)O(n) avg, collisions ~10⁻¹⁸O(n)Hostile inputs: still possible, use 3 hashes or KMP.
ManacherO(n)O(n)O(n)Need only one palindrome check: simpler DP/two-pointer wins.
Aho-CorasickO(Σm)O(n + occurrences)O(Σm · σ)One pattern: use KMP. Σ ≪ patterns: trie blows up.
Suffix array + LCPO(n log n) (or O(n))O(log n) per substring queryO(n)Online updates: not supported.

Notes. Rolling-hash worst case is O(nm) per query only if every hash compare collides — astronomically unlikely with double-hash. The realistic concern is adversarial inputs constructed to force collisions; competitive-programming judges occasionally have these. For ML/parsing/grep-like workloads, single-hash is fine.

14 · What breaks

  • Single-hash collisions on adversarial inputs. Known-base, known-mod single hash can be cracked offline; CF and judge writers do it. Symptom: passes 90% of tests, fails the last 2 with "wrong answer". Fix: double-hash, randomised base, or switch to KMP.
  • KMP failure-function off-by-one. Mixing fail[i] with fail[i − 1] in the recursion. The roll-back is k = fail[k − 1]; the value at fail[i] is set after the inner loop. Easy to flip; symptom is the matcher reporting one-off positions.
  • Z-array boundary mishandling. Convention varies: some references set z[0] = n, others set z[0] = 0. Match whichever your downstream code expects. Symptom: pattern matching includes the trivial "the string starts with itself" match.
  • Very large alphabets. Aho-Corasick stores a transition array per node, sized to the alphabet. For Unicode (σ = 10⁶+), this blows up to gigabytes. Use a map per node, or fall back to suffix-automaton.
  • Hash modulus overflow. If B · h + s[i] overflows int64 before the mod, you get garbage. Either pick smaller bases or use Go's int (64-bit) with mod taken between multiplications. Worse in Java/C++ without unsigned types.
  • Separator clash. Concatenations like p + '#' + t assume '#' isn't in either string. If the input alphabet includes '#', pick a character outside the alphabet (e.g. byte 255) or a multi-character sentinel.
  • Rolling-hash sign-flip. h[r+1] − h[l] · pow[r − l + 1] can go negative. Always add mod and re-mod before comparing.
  • Manacher index conversion. The transformed string with separators is 2n + 1 long. Converting a palindrome radius back to original-string coordinates is the part everyone gets wrong on the first attempt.

15 · Where this gets asked

CompanyCommon framingWhy it fits
GooglestrStr (LC 28), Shortest Palindrome (LC 214), Repeated Substring Pattern (LC 459)KMP is a recurring Google L5 signal — they want to see the failure-function construction live.
MetaLongest Happy Prefix (LC 1392), Distinct Echo Substrings (LC 1316), Find All Anagrams (LC 438)Search-index and recommendation infra are heavy on pattern-matching; KMP + rolling hash come up in onsite rounds.
AmazonImplement strStr (LC 28), Repeated DNA Sequences (LC 187), Longest Duplicate Substring (LC 1044)Substring search on log files / DNA-style sequences. LC 1044 is the canonical "binary search on length + rolling hash" combo.
MicrosoftWord Search (LC 79), Stream of Characters (LC 1032), Palindrome Pairs (LC 336)Trie + Aho-Corasick style problems frequent in editor / search rounds.
Bloomberg / Two SigmaSubstring with concatenation, fuzzy matching, log-line classificationFinancial-news ingestion uses high-throughput substring search; rolling hash and Aho-Corasick are both in the production toolbox.
Pattern of patterns. Three sub-shapes — (1) "find pattern in text" → KMP or Z or hash; (2) "longest prefix that is also a suffix" → KMP failure function; (3) "compare many substrings of the same string" → rolling hash with prefix-hash table. The fourth, if the problem demands it: (4) "all palindromes" → Manacher. Recognise which, write the auxiliary array from memory, and the rest is indexing.

16 · Further reading

  • CP Algorithms — String Algorithms. The reference. Prefix-function (KMP), Z-function, polynomial hashing, Manacher, suffix array, suffix automaton — all with code. cp-algorithms.com/string/.
  • Steven Skiena. The Algorithm Design Manual, Chapter 18. The most readable textbook treatment of string matching, with implementation tips. The "war stories" sections are gold.
  • Udi Manber. "A text compression scheme that allows fast searching directly in the compressed file" (1993). The original paper for many of the ideas later refined into modern suffix-array work.
  • Aho & Corasick. "Efficient string matching: an aid to bibliographic search" (1975). The original Aho-Corasick paper. Short and surprisingly readable.
  • Knuth, Morris & Pratt. "Fast pattern matching in strings" (1977). The KMP paper. Worth reading once to see the failure-function insight derived from first principles.
  • Adjacent: trie pattern. The data structure behind Aho-Corasick and most multi-pattern matching.
  • Adjacent: back to patterns index.
Found this useful?