String matching
Three tools, one job. Naive substring search is O(nm); for any non-trivial input that's a TLE. KMP and Z give you O(n + m) with a precomputed failure function. Rolling hash gives you O(1) substring comparison — fast on average, sometimes catastrophic on adversarial inputs. Knowing when to reach for which is the senior-string-problem skill.
1 · Why this pattern matters
The naive "for every position in text, compare against
pattern" approach is O(nm). Fine for n, m ≤ 100. Anything
larger and you're paying for repeated comparisons of overlapping
prefixes. The whole point of KMP, Z, and rolling hash is that those
repetitions can be precomputed.
Common LC shapes where these tools earn their keep:
- LC 28 strStr. The textbook "find pattern in text" — both KMP and rolling hash are clean answers.
- LC 459 Repeated Substring Pattern. "Is the string a repetition of a smaller substring?" The KMP failure function decides this in one expression.
- LC 214 Shortest Palindrome. Build the shortest palindrome by prepending characters. Solved by KMP on
s + '#' + reverse(s). - LC 1392 Longest Happy Prefix. "Longest proper prefix that is also a suffix." This is literally what the KMP failure function computes.
- LC 1316 Distinct Echo Substrings. Find all substrings that are concatenations of a string with itself. Rolling hash for O(1) substring comparison.
2 · How to recognise it
- Naive substring search is too slow. Constraints say
|s|, |t| ≤ 10⁵. O(nm) is 10¹⁰ ops — TLE. - You need ALL occurrences of pattern in text. Not just "does it appear" — KMP / Z find every match in O(n + m).
- "Longest proper prefix that is also a suffix". Anything phrased this way is a direct KMP failure-function readout.
- Compare many substrings of the same string. Rolling hash gives O(1) per comparison after O(n) preprocessing.
- Palindrome anywhere in linear time. Manacher's algorithm — a string-matching cousin — finds all palindromic substrings in O(n).
- Multiple patterns vs one text. Aho-Corasick generalises KMP to a trie of patterns; one pass through the text finds them all.
3 · KMP — the failure function
Knuth–Morris–Pratt's insight: when a mismatch happens at position
j in the pattern, you don't need to back up in the text.
Some suffix of pattern[0..j−1] is already a prefix of
pattern — slide the pattern forward by enough to align that
prefix, and continue from there. The failure function
fail[i] stores the length of the longest proper prefix of
pattern[0..i] that is also a suffix.
"Proper" means strictly shorter than the substring itself. For the
pattern "abab": fail = [0, 0, 1, 2]. The "ab"
prefix is also a suffix of the whole thing — length 2.
// Build KMP failure function for pattern p.
// fail[i] = length of longest proper prefix of p[0..i] that is also a suffix.
func kmpFail(p string) []int {
n := len(p)
fail := make([]int, n)
k := 0 // length of current matched prefix
for i := 1; i < n; i++ {
// Roll back via fail-chain until either p[k] == p[i] or k == 0
for k > 0 && p[k] != p[i] {
k = fail[k-1]
}
if p[k] == p[i] {
k++
}
fail[i] = k
}
return fail
}The construction is amortised O(m). The inner for loop
looks scary but the total decrement of k across all
iterations is bounded by the total increment — and k
increments at most once per outer step. Aggregate analysis gives O(m).
p[k] != p[i],
we jump k = fail[k-1]. This says "the previous best match
ended at position k-1; the best shorter match is
whatever fail[k-1] records". Following the chain
fail[fail[k-1] - 1] etc. enumerates all proper
prefix-suffix lengths in decreasing order. Useful for problems like
LC 459 (repeated pattern) where you want the second-longest, not the
longest.4 · KMP — matching
Once you have fail for the pattern, matching against text
t is the same loop with two strings instead of one. Maintain
a counter j for how much of the pattern matches the current
suffix of text seen so far. On a character match, advance j;
on a mismatch, roll back j using fail until
either you find a partial match or hit zero.
// Find all starting indices in t where p occurs.
func kmpSearch(t, p string) []int {
if len(p) == 0 {
return []int{0}
}
fail := kmpFail(p)
var out []int
j := 0
for i := 0; i < len(t); i++ {
for j > 0 && t[i] != p[j] {
j = fail[j-1] // fall back along the fail-chain
}
if t[i] == p[j] {
j++
}
if j == len(p) {
out = append(out, i-len(p)+1)
j = fail[j-1] // continue searching for overlapping matches
}
}
return out
}Total cost O(n + m): O(m) to build fail, O(n) to scan
t. The text pointer i never decreases —
that's the whole point. Naive search restarts at i + 1 on
every mismatch; KMP just adjusts j and keeps reading.
j = fail[j-1]. After
reporting a match at i − m + 1, the algorithm needs to keep
searching for overlapping matches. Setting j = fail[m−1]
restarts the matcher in a state equivalent to "we've already matched
the suffix of the pattern that equals its longest prefix". For
pattern "aaa" in text "aaaaa", this finds
three overlapping matches at indices 0, 1, 2.5 · Z-algorithm
The Z-array gives, for every position i in a string,
z[i] = length of the longest substring starting at
i that matches a prefix of the string. By convention
z[0] = n (or 0; the spec varies). The Z-array gives you
the same matching power as KMP, often in fewer lines.
The classic Z computation maintains a "Z-box" [l, r] — the
rightmost prefix-match-window seen so far. For each new i,
if i ≤ r, copy the answer from z[i − l] as a
head-start; then extend by comparing characters past r.
Amortised O(n).
// Build the Z-array. z[i] = longest prefix of s matching s[i..].
func zArray(s string) []int {
n := len(s)
z := make([]int, n)
if n == 0 {
return z
}
z[0] = n
l, r := 0, 0
for i := 1; i < n; i++ {
if i < r {
z[i] = min(r-i, z[i-l]) // head-start from the Z-box
}
for i+z[i] < n && s[z[i]] == s[i+z[i]] {
z[i]++ // extend
}
if i+z[i] > r {
l, r = i, i+z[i] // new rightmost Z-box
}
}
return z
}
func min(a, b int) int { if a < b { return a }; return b }To find all matches of pattern p in text t,
build z on the string p + '#' + t where
'#' is any character not in either. Then for every
i > len(p) with z[i] == len(p), you've
found a match. The separator prevents the Z-box from running past the
pattern into the text region during construction.
6 · KMP vs Z — when to reach for which
Both are O(n + m) and either solves the basic "find pattern in text" problem. The differences are about what the auxiliary array also tells you.
| Use case | Reach for | Why |
|---|---|---|
| Match-anywhere search | Z (on p + '#' + t) | Z directly tells you match positions; one pass, no separate matcher. |
| "Longest proper prefix = suffix" | KMP failure function | The failure function IS the answer. fail[n−1] is it. |
| Stream of text, fixed pattern | KMP | State j is updated incrementally as text arrives; no need to buffer. |
| Period of a string | KMP failure function | n − fail[n−1] is the minimal period iff it divides n. |
| "Every position's longest prefix match" | Z | Z gives this directly; KMP would require reversing roles. |
| Multi-pattern matching | Aho-Corasick (KMP generalisation) | Build a trie of patterns + failure links; one O(n + Σm) pass over text. |
| "Is s a rotation of t?" | Either: KMP-search for s in t + t | If |s| == |t| and KMP finds a match, yes. O(n). |
7 · Rolling hash (polynomial)
Treat a string as a base-B number modulo a large prime
P. Precompute prefix hashes h[0..n] where
h[i] is the hash of s[0..i−1]. Then the hash
of any substring s[l..r] is
(h[r+1] − h[l] · B^(r−l+1)) mod P, computable in O(1).
This buys you O(1) substring equality (subject to collisions). Two substrings are equal if their hashes match — almost certainly. To make "almost" close enough to "certainly" on adversarial inputs, use two independent hash functions (different bases and primes) and consider substrings equal only if both hashes match. Collision probability drops to about 1/P² — astronomically small for P ≈ 10⁹.
// Polynomial rolling hash with double-hash to dodge adversarial collisions.
const (
B1 = 131
M1 = 1_000_000_007
B2 = 137
M2 = 998_244_353
)
type RollingHash struct {
h1, h2 []int64 // prefix hashes
p1, p2 []int64 // base powers
}
func NewRollingHash(s string) *RollingHash {
n := len(s)
rh := &RollingHash{
h1: make([]int64, n+1),
h2: make([]int64, n+1),
p1: make([]int64, n+1),
p2: make([]int64, n+1),
}
rh.p1[0], rh.p2[0] = 1, 1
for i := 0; i < n; i++ {
rh.h1[i+1] = (rh.h1[i]*B1 + int64(s[i])) % M1
rh.h2[i+1] = (rh.h2[i]*B2 + int64(s[i])) % M2
rh.p1[i+1] = (rh.p1[i] * B1) % M1
rh.p2[i+1] = (rh.p2[i] * B2) % M2
}
return rh
}
// Hash of s[l..r] inclusive, returned as a (h1, h2) pair.
func (rh *RollingHash) Hash(l, r int) (int64, int64) {
a := (rh.h1[r+1] - rh.h1[l]*rh.p1[r-l+1]) % M1
if a < 0 {
a += M1
}
b := (rh.h2[r+1] - rh.h2[l]*rh.p2[r-l+1]) % M2
if b < 0 {
b += M2
}
return a, b
}
// Equal returns true iff s[l1..r1] == s[l2..r2] by double-hash comparison.
func (rh *RollingHash) Equal(l1, r1, l2, r2 int) bool {
if r1-l1 != r2-l2 {
return false
}
a1, b1 := rh.Hash(l1, r1)
a2, b2 := rh.Hash(l2, r2)
return a1 == a2 && b1 == b2
}10⁹+7 hash). Double-hash with two coprime large primes
and randomised bases is effectively uncrackable; cumulative collision
probability over 10⁶ comparisons stays below 10⁻⁹. Pick bases at
runtime if you really care.The Rabin-Karp algorithm is rolling hash plus a sliding window over the
text — at each position, compute the hash of text[i..i+m−1]
and compare to the pattern hash. With prefix hashes this is the
Hash(i, i+m−1) call; the "rolling" part (subtract leaving
char, add entering char) is an implementation detail you only need if
you can't precompute.
8 · Worked example 1 — LC 28 strStr (easy)
Problem. Implement strStr: return the
index of the first occurrence of needle in
haystack, or −1.
The textbook substring search. Two clean answers — KMP and rolling hash. Both O(n + m). Pick whichever you remember the template for.
KMP solution
func strStr(haystack, needle string) int {
if len(needle) == 0 {
return 0
}
fail := kmpFail(needle)
j := 0
for i := 0; i < len(haystack); i++ {
for j > 0 && haystack[i] != needle[j] {
j = fail[j-1]
}
if haystack[i] == needle[j] {
j++
}
if j == len(needle) {
return i - len(needle) + 1
}
}
return -1
}Rolling-hash solution
func strStr(haystack, needle string) int {
n, m := len(haystack), len(needle)
if m == 0 {
return 0
}
if m > n {
return -1
}
rh := NewRollingHash(haystack)
rhN := NewRollingHash(needle)
tgtA, tgtB := rhN.Hash(0, m-1)
for i := 0; i+m <= n; i++ {
a, b := rh.Hash(i, i+m-1)
if a == tgtA && b == tgtB {
return i
}
}
return -1
}Both run in O(n + m). The rolling-hash version is shorter and easier to debug; the KMP version is unaffected by adversarial inputs. For an interview, write KMP if asked about worst-case complexity; write rolling hash if asked for "simplest correct" with average-case good enough.
9 · Worked example 2 — LC 1392 Longest Happy Prefix (medium)
Problem. Given a string s, return the
longest string that is both a proper prefix and a proper suffix of
s. Return empty string if none exists.
"Proper prefix = suffix" is literally the definition of the KMP
failure function at the last position. fail[n−1] gives
the length; slice s[:fail[n−1]] gives the substring.
Two lines after the build.
Steps
- Build
failfors. O(n). - Read off
fail[n − 1]: the length of the longest proper prefix ofsthat is also a suffix. - Return
s[:fail[n − 1]].
func longestPrefix(s string) string {
n := len(s)
if n == 0 {
return ""
}
fail := kmpFail(s)
return s[:fail[n-1]]
}Total O(n). The whole problem is a one-line readout once you've
internalised what the failure function means. This is why the KMP
construction is worth memorising — half a dozen LC problems collapse
to a single index access on fail.
k from n − 1 down to 1 and check
whether hash(s[:k]) == hash(s[n−k:]). Stop at the first
match. Same answer, average O(n) (you usually hit a match early), but
worst-case O(n²) and risks collisions. KMP is the right tool.10 · Worked example 3 — LC 214 Shortest Palindrome (hard)
Problem. Given a string s, prepend the
fewest characters in front to make it a palindrome. Return the
resulting palindrome.
The trick: find the longest palindromic prefix of
s. Everything after that prefix is what needs to be
mirrored and prepended. So the question reduces to "what's the longest
prefix of s that equals its own reverse?"
Build the string s + '#' + reverse(s) and run KMP on it.
The failure function's last value gives exactly the longest prefix of
s that matches some suffix of reverse(s) —
which is the longest palindromic prefix of s. The
separator '#' prevents the match from running past the
boundary.
Steps
- Construct
combined = s + "#" + reverse(s). The separator must be a character not ins. - Compute
failforcombined. Readk = fail[len(combined) − 1]— the length of the longest palindromic prefix ofs. - The tail of
snot covered by this prefix iss[k:]. Prepend its reverse tos:reverse(s[k:]) + s.
func shortestPalindrome(s string) string {
if len(s) == 0 {
return s
}
rev := reverse(s)
combined := s + "#" + rev
fail := kmpFail(combined)
k := fail[len(combined)-1] // longest palindromic prefix length
return reverse(s[k:]) + s
}
func reverse(s string) string {
b := []byte(s)
for i, j := 0, len(b)-1; i < j; i, j = i+1, j-1 {
b[i], b[j] = b[j], b[i]
}
return string(b)
}Total O(n). The clever bit is reframing "longest palindromic prefix"
as "longest prefix of s that is also a suffix of
reverse(s)" — which is exactly the KMP failure function on
the concatenation.
'#',
the KMP failure function might find matches that span the boundary
between s and reverse(s). For
s = "aa", s + reverse(s) = "aaaa" would give
fail[3] = 3 — the whole string — which is wrong. The
separator forces the match to terminate at the boundary, giving
fail[len − 1] ≤ len(s) always.11 · Sister algorithms
KMP, Z, and hash are the workhorses. A few neighbours come up in harder problems — worth knowing by name and shape.
- Manacher's algorithm — all palindromes in O(n).
- Finds the longest palindromic substring centred at every position. Insert separators (
'#') between characters to unify odd-length and even-length palindromes; maintain a "right boundary" similar to Z's Z-box. The whole string's palindromic structure in linear time. Used in LC 5, LC 647, LC 1960. Cleaner than Z + DP, faster than DP, but the constant factor and the index gymnastics make it the algorithm most people skip in interviews. - Aho-Corasick — multi-pattern matching.
- KMP generalised to a set of patterns. Build a trie of the patterns; add KMP-style failure links pointing to the longest proper suffix that's also a prefix of some pattern. One O(n + Σm + occurrences) pass over the text finds all matches of any pattern. Used in
grep -F, intrusion detection (Snort), and DNA pattern banks. Standard answer to "given a dictionary of forbidden words, find them all in this document". - Suffix array + LCP.
- Sort all suffixes of a string. The suffix array gives you a sorted list of starting indices; the LCP (longest common prefix) array gives the prefix length between adjacent suffixes. Together they answer "longest repeated substring", "number of distinct substrings", "longest common substring of two strings". Build in O(n log n) via DC3 or O(n) via SA-IS. Heavyweight, but many CP problems collapse to "compute SA, then do a sweep".
- Suffix automaton.
- The minimal DFA recognising all substrings of a string. Build in O(n), supports "is substring", "count occurrences of substring", "k-th distinct substring", and "longest common substring of multiple strings". The hardest of these to implement from memory, but its operations are O(|query|), the asymptotic optimum.
- Suffix tree.
- The trie of all suffixes of a string, with edge labels compressed. Same problems as suffix array, often the same complexity, more pointer-heavy. Mostly historical now — suffix array + LCP wins on cache behaviour and implementation simplicity.
12 · How to solve hard problems step-by-step
A meta-process for any string-matching problem. Four steps.
- Frame the question. Is it "find pattern in text", "longest prefix that is also a suffix", "all palindromes", or "compare many substrings"? The frame chooses the algorithm.
- Pick the tool. Pattern-vs-text → KMP or Z (or hash if you trust collisions). Prefix-suffix → KMP failure function. Palindromes → Manacher. Multiple substring comparisons in the same string → rolling hash with prefix-hash preprocessing.
- Construct the auxiliary array. KMP needs
failon the pattern (or on a pattern + separator + text). Z needszon the same combined string. Rolling hash needsh[]andpow[]on the source string. - Defend against the failure mode. KMP: roll-back edge cases (zero-length pattern, all-equal characters). Z: index-zero convention, Z-box boundary. Hash: collisions — use double-hash unless the problem is small. Manacher: index conversion between original and separator-padded string.
13 · Complexity & when not to use
| Algorithm | Preprocess | Match | Space | When to skip |
|---|---|---|---|---|
| Naive | — | O(nm) | O(1) | Anything past n, m ≈ 10³. |
| KMP | O(m) | O(n) | O(m) | Multi-pattern: use Aho-Corasick. |
| Z | O(n + m) | O(n + m) | O(n + m) | Streaming text: use KMP (stateful). |
| Rolling hash (single) | O(n) | O(n) avg, O(nm) worst | O(n) | Adversarial inputs: use double-hash or KMP. |
| Rolling hash (double) | O(n) | O(n) avg, collisions ~10⁻¹⁸ | O(n) | Hostile inputs: still possible, use 3 hashes or KMP. |
| Manacher | O(n) | O(n) | O(n) | Need only one palindrome check: simpler DP/two-pointer wins. |
| Aho-Corasick | O(Σm) | O(n + occurrences) | O(Σm · σ) | One pattern: use KMP. Σ ≪ patterns: trie blows up. |
| Suffix array + LCP | O(n log n) (or O(n)) | O(log n) per substring query | O(n) | Online updates: not supported. |
Notes. Rolling-hash worst case is O(nm) per query only if every hash compare collides — astronomically unlikely with double-hash. The realistic concern is adversarial inputs constructed to force collisions; competitive-programming judges occasionally have these. For ML/parsing/grep-like workloads, single-hash is fine.
14 · What breaks
- Single-hash collisions on adversarial inputs. Known-base, known-mod single hash can be cracked offline; CF and judge writers do it. Symptom: passes 90% of tests, fails the last 2 with "wrong answer". Fix: double-hash, randomised base, or switch to KMP.
- KMP failure-function off-by-one. Mixing
fail[i]withfail[i − 1]in the recursion. The roll-back isk = fail[k − 1]; the value atfail[i]is set after the inner loop. Easy to flip; symptom is the matcher reporting one-off positions. - Z-array boundary mishandling. Convention varies: some references set
z[0] = n, others setz[0] = 0. Match whichever your downstream code expects. Symptom: pattern matching includes the trivial "the string starts with itself" match. - Very large alphabets. Aho-Corasick stores a transition array per node, sized to the alphabet. For Unicode (σ = 10⁶+), this blows up to gigabytes. Use a map per node, or fall back to suffix-automaton.
- Hash modulus overflow. If
B · h + s[i]overflows int64 before the mod, you get garbage. Either pick smaller bases or use Go'sint(64-bit) with mod taken between multiplications. Worse in Java/C++ without unsigned types. - Separator clash. Concatenations like
p + '#' + tassume'#'isn't in either string. If the input alphabet includes'#', pick a character outside the alphabet (e.g. byte 255) or a multi-character sentinel. - Rolling-hash sign-flip.
h[r+1] − h[l] · pow[r − l + 1]can go negative. Always addmodand re-mod before comparing. - Manacher index conversion. The transformed string with separators is 2n + 1 long. Converting a palindrome radius back to original-string coordinates is the part everyone gets wrong on the first attempt.
15 · Where this gets asked
| Company | Common framing | Why it fits |
|---|---|---|
| strStr (LC 28), Shortest Palindrome (LC 214), Repeated Substring Pattern (LC 459) | KMP is a recurring Google L5 signal — they want to see the failure-function construction live. | |
| Meta | Longest Happy Prefix (LC 1392), Distinct Echo Substrings (LC 1316), Find All Anagrams (LC 438) | Search-index and recommendation infra are heavy on pattern-matching; KMP + rolling hash come up in onsite rounds. |
| Amazon | Implement strStr (LC 28), Repeated DNA Sequences (LC 187), Longest Duplicate Substring (LC 1044) | Substring search on log files / DNA-style sequences. LC 1044 is the canonical "binary search on length + rolling hash" combo. |
| Microsoft | Word Search (LC 79), Stream of Characters (LC 1032), Palindrome Pairs (LC 336) | Trie + Aho-Corasick style problems frequent in editor / search rounds. |
| Bloomberg / Two Sigma | Substring with concatenation, fuzzy matching, log-line classification | Financial-news ingestion uses high-throughput substring search; rolling hash and Aho-Corasick are both in the production toolbox. |
16 · Further reading
- CP Algorithms — String Algorithms. The reference. Prefix-function (KMP), Z-function, polynomial hashing, Manacher, suffix array, suffix automaton — all with code.
cp-algorithms.com/string/. - Steven Skiena. The Algorithm Design Manual, Chapter 18. The most readable textbook treatment of string matching, with implementation tips. The "war stories" sections are gold.
- Udi Manber. "A text compression scheme that allows fast searching directly in the compressed file" (1993). The original paper for many of the ideas later refined into modern suffix-array work.
- Aho & Corasick. "Efficient string matching: an aid to bibliographic search" (1975). The original Aho-Corasick paper. Short and surprisingly readable.
- Knuth, Morris & Pratt. "Fast pattern matching in strings" (1977). The KMP paper. Worth reading once to see the failure-function insight derived from first principles.
- Adjacent: trie pattern. The data structure behind Aho-Corasick and most multi-pattern matching.
- Adjacent: back to patterns index.