Suffix Tree Simulator

Every suffix of a string, packed into a single compressed trie. Every substring is a path from the root. Substring search collapses to O(m) — independent of the text length. The structure that powers grep, BLAST, the BWT, and every serious bioinformatics pipeline.

Text length
7
Nodes
11
Leaves
7
Last search

Text · ends with $
Search a substring
Demos
Tree : each path from root spells a substring; each leaf is a suffix start position
banana$ a na $ na $ na$ $ na$ $ 0 1 3 5 2 4 6
internal node (a branching point) leaf (suffix start index) walked by last search
All suffixes of "banana$"
0 banana$
1 anana$
2 nana$
3 ana$
4 na$
5 a$
6 $
Build (Ukkonen)O(n) — linear, online
Build (naive)O(n²) — used by this simulator
Substring searchO(m) — proportional to query, not text
All occurrencesO(m + k) — k = number of matches
Longest common substring (k strings)O(total length)
SpaceO(n) words, but the constant is large (~20n bytes)

What you're looking at

The diagram is the suffix tree of whatever string sits in the Text box — it has to end in the sentinel $. Hollow circles are internal nodes, the branch points, each one a substring that repeats; filled dots are leaves, labelled with the position where that suffix begins. Every edge carries a chunk of the text. Type a substring in the search box and hit find: the path the search walks lights up in the accent colour, and the result line reports the start positions where it occurs. Below sit the full list of suffixes and a complexity table.

Load the mississippi demo and search issi — two leaves light up at positions 1 and 4, because the substring repeats. Search ssipp for a single hit, or xyz to watch the walk die at the root with no path to follow. The thing worth noticing is that the search reads only as many characters as the query is long, never the whole text — that work was paid up front when the tree was built. Edit the text and the whole tree rebuilds live, so you can see how repeated letters create the branch points.


What is a suffix tree?

Searching for any substring, a million times.

A suffix tree is a compressed trie of every suffix of a string, supporting substring search in O(m) time (where m is the pattern length), independent of the text length. Peter Weiner introduced it in 1973; Edward McCreight simplified the construction in 1976; Esko Ukkonen gave the now-standard online construction in 1995. Today: the index in Bowtie/BLAST sequence search, the deduplication core in bzip2, the substring engine in many text editors.

Imagine you have a single very large text (a genome of three billion characters, a multi-gigabyte web crawl, the entire history of your application's logs), and you want to answer queries of the form does the substring “ATGCAT” appear anywhere in here, and if so, at which positions? The naive answer is to scan the text from start to finish for each query, checking the pattern against every starting position. This is O(n) per query, where n is the text length. For a billion-character text and a million queries, that is 10¹⁵ operations — weeks of CPU time. Smarter scans (Knuth–Morris–Pratt, Boyer–Moore) shave constant factors but stay O(n).

A first attempt to do better is to flip the problem inside out. Instead of looking at the text once per query, build an index of the text once up front, and then answer queries against that index. The index has to be searchable by any substring, not just by words. There are no spaces in DNA, and even in English text the user might search for a substring that crosses word boundaries. So a word-level inverted index (the kind Lucene and Elasticsearch use) is not enough. You need an index of every possible substring of the text.

That sounds prohibitively expensive: a string of length n has on the order of n²/2 distinct substrings. Storing them all is out of the question. The clever observation that makes the problem tractable is that every substring of T is a prefix of some suffix of T. There are only n suffixes. If you index those, you have implicitly indexed every substring. Build a tree (a trie) over all the suffixes; compress every chain of single-child edges into one labelled edge; the result has at most 2n − 1 nodes regardless of text content. This is the suffix tree.

Once the tree exists, substring search is a walk from the root following the pattern's characters. The cost is O(m) where m is the pattern length, independent of the text length. The same trillion-operation workload above drops to a few seconds, because each query reads at most a few dozen characters of the index. The simulator above shows the tree on small strings like banana$ and mississippi$; production implementations on a human genome take roughly twenty bytes per character of text (about 60 GB) and answer queries in microseconds.

There is one catch worth meeting head-on. Suffix trees are notoriously memory-hungry. The 20-byte-per-character cost of a real implementation is the price of all the pointers. The structure has many close relatives (suffix arrays at four to eight bytes per character, FM-indexes at well under one byte per character) that solve the same problem more frugally. Modern bioinformatics tools (BWA, Bowtie2, BLAST) all use FM-indexes for exactly this reason. The suffix tree itself remains the best teaching tool because every key concept (repeated substrings, longest common substring, the Burrows–Wheeler transform) has a clean visual interpretation in the tree. Once you see it, the array and FM-index variants make immediate sense.

SUFFIXES OF "banana$" · COMPRESSED INTO A TREE0 banana$1 anana$2 nana$3 ana$4 na$5 a$6 $rootabanana$na$na…$[3,5][5][0]na$$[2][4][6]SHARED PREFIXES MERGE · INTERNAL NODES = REPEATED SUBSTRINGS

Suffix tree structure: a trie of every suffix, compressed

A trie of every suffix — compressed.

Take a string, say banana$, and list every suffix:

0  banana$
1  anana$
2  nana$
3  ana$
4  na$
5  a$
6  $

Now insert each one into a trie (a prefix tree). The trie has n + 1 distinct paths from the root — one per suffix. Walking from root to any leaf spells a complete suffix; walking to any internal node spells a substring that appears multiple times (because the path branches there to reach distinct suffixes that share a prefix).

A naive trie would have O(n²) edges in the worst case (each suffix has length up to n; n suffixes; n edges per suffix). The suffix tree compresses every chain of single-child edges into one edge labelled with the substring. The result has at most 2n − 1 nodes and exactly n leaves. Linear space, regardless of text content.

Why does this matter? Two reasons. First, every substring of the text is a prefix of some suffix, so every substring corresponds to a unique walk-from-root path. Substring search becomes "follow the path letter by letter": O(m) where m is the query length, with no dependence on text length n. Second, internal nodes correspond exactly to repeated substrings, which means the tree itself is a compact encoding of all repetitions in the text. Many string algorithms (longest repeated substring, longest common substring of k strings, longest palindrome, distinct substring count) fall out as O(n) tree traversals.

The terminator $ at the end of the text is a small but important detail. It's a sentinel symbol that doesn't appear in the alphabet. Without it, some suffixes would be prefixes of others and the algorithm would have to special-case "this suffix ended in the middle of someone else's edge." With the sentinel, every suffix gets its own leaf.

How a suffix tree finds substrings

Searching is walking down a path.

To find every occurrence of pattern P in text T, walk from the root following the characters of P one at a time. At each step you're either on an edge (matching the next character) or at a branching node (picking which edge to follow). If at any point the next character doesn't appear in the right place, P isn't in T. If you successfully walk all m characters of P, you've reached a node (or a position on an edge) — and every leaf in that subtree corresponds to a position where P starts in T.

function search(root, P) {
  let cur = root;
  let i = 0;                            // index into P
  while (i < P.length) {
    const child = cur.children.get(P[i]);
    if (!child) return [];              // first char missing → no match
    const edge = child.edgeFromParent;
    let k = 0;
    while (k < edge.length && i < P.length) {
      if (edge[k] !== P[i]) return [];  // diverged → no match
      k++; i++;
    }
    cur = child;
  }
  // Collect all leaves in cur's subtree
  return collectLeaves(cur);             // each leaf has a suffixIdx
}

The cost is O(m) for the walk plus O(k) to enumerate the k matches. The text length n doesn't appear. Compare to the naive scan, which is O(n·m), or even Boyer-Moore which is O(n) average and O(n·m) worst case. Suffix trees move work to build time so query time is bounded.

Try the simulator. Set the text to mississippi$ and search for issi. Two leaves get illuminated, at suffix positions 1 and 4. Search for ssipp: one leaf, position 5. Search for missis (a substring): one leaf, position 0. Search for xyz: no path exists from the root, the search aborts immediately.

Origins: Ukkonen, McCreight, Weiner, three paths to O(n)

Ukkonen, McCreight, Weiner — three paths to O(n).

The naive O(n²) build (used by this simulator for clarity) inserts each suffix one at a time. For each suffix, walk from root, following edges and splitting them when necessary. Each insert costs O(n) so total is O(n²). Adequate up to maybe a million characters; beyond that, you need a linear-time algorithm.

Three landmark constructions exist:

  • Weiner (1973): first ever linear-time construction. Builds the tree right-to-left. Algorithmically clever, hard to implement.
  • McCreight (1976): left-to-right but offline (needs the full string up front). Cleaner code than Weiner; still elaborate.
  • Ukkonen (1995): left-to-right, online (can build incrementally as text streams in), most elegant code, and the one taught in modern courses. Uses suffix links and "implicit" trees that are made explicit on demand.

All three achieve O(n) total time but the constants differ. Ukkonen's algorithm uses two key tricks: suffix links (a pointer from each internal node u to the node whose path label is "u's path label minus the first character") and active-point bookkeeping (don't redundantly retraverse the tree from the root for each new suffix; jump via the suffix link). The result is amortised O(1) work per character, so building a tree on a 1GB genome takes seconds, not hours.

For everyday production use you usually don't write Ukkonen yourself; you reach for a library (sdsl, libdivsufsort, or simpler: just build a suffix array, which is faster and uses less space; see Part 5). The conceptual model in this simulator is what matters: every internal node = repeated substring, every leaf = suffix start, every edge label = a substring of the text.

What suffix trees solve: a surprising number of classical problems

A surprising number of classical problems.

Once you have the tree, a small toolkit of traversals solves a long list of textbook string problems. Each is O(n) or O(n log n) in tree operations:

  • Substring search: walk root-to-end of pattern. O(m + k) for k matches. The base case.
  • Number of distinct substrings: sum (edge length) over all edges. O(n).
  • Longest repeated substring: deepest internal node (by string-depth). O(n).
  • Longest common substring of two strings: build a generalised suffix tree on S₁ # S₂ $, find the deepest internal node whose subtree contains leaves from both strings. O(n + m).
  • Longest palindromic substring: build the suffix tree of S # reverse(S); longest common substring is the answer (with care). O(n).
  • Lempel-Ziv parsing: the LZ77 / LZ78 compression family parses a string into earlier-occurrence references; doable in O(n) with a suffix tree.
  • Burrows-Wheeler Transform: BWT (which underpins bzip2, BWA-MEM, FM-index search) is the leaf order of the suffix tree (or sorted suffix array) under a particular numbering. Linear time once you have the structure.
  • Pattern occurrence counting: subtree-leaf count at the end of the search walk. O(m + log n) with appropriate node augmentation.

The reason all of these reduce to suffix-tree traversals: every interesting shared substructure of the input text shows up as an internal node in the tree. Repeats, palindromes, common substrings: they're all "subtrees with leaves from particular suffix-position sets," which the tree happens to make visible.

Suffix tree, suffix array, FM-index: one essence

Three structures, one essence.

In production today, the suffix tree is rarely the structure of choice. Two relatives, the suffix array and the FM-index, preserve most of the algorithmic power with dramatically lower memory:

Suffix treeSuffix arrayFM-index (BWT)
Memory (bytes/char)15-254-80.4-2
Build timeO(n)O(n)O(n)
Substring searchO(m)O(m log n)O(m)
Easy to implementnoyesmoderate
Online (streaming) buildyes (Ukkonen)nono
Compressiblenosomewhatexcellent

Suffix arrays store the suffixes in sorted order, just an array of integers (suffix start positions). Substring search is binary search over the suffixes plus an O(m) comparison per probe. The LCP array (longest common prefix between adjacent suffixes) restores most of the suffix-tree functionality. Together they fit in 4-8 bytes per character.

FM-indexes compress the suffix array further by storing the Burrows-Wheeler transform of the text plus a few small auxiliary tables. The BWT is the last column of the sorted-rotations matrix — and it groups similar characters together, making the compressed form astonishingly small. Modern bioinformatics tools (BWA, Bowtie2) index human-genome-scale data into ~2GB FM-indexes that fit easily in memory.

When to use a suffix tree: education, online construction (where the text streams in), problems whose recursion naturally maps to "subtree of internal node X". When to use a suffix array: most production string-search problems below a few GB of text. When to use an FM-index: any genome-scale or bigger workload where memory matters more than constant factors.

Real systems built on this family of structures

Real systems built on this family of structures.

  • BLAST: the most-cited algorithm in biology, used billions of times a year for protein and DNA homology search. Its index is a hash of small k-mers (a simplified suffix-tree approximation), but the underlying intuition is purely suffix-based.
  • BWA-MEM and Bowtie2: modern short-read aligners for DNA sequencing. Both use FM-index over the BWT of a reference genome. A sequencing run produces hundreds of millions of short reads; matching them to a 3.2-billion-base reference would be impossible without these structures.
  • bzip2: compression. Its first stage is the BWT applied to blocks of input; the transformed data has long runs of identical characters that subsequent stages encode efficiently.
  • grep -F (fixed-string search): historically uses Aho-Corasick (a generalisation), but the underlying object (a "trie of patterns") is the same family.
  • Spell-checkers and autocomplete use simpler tries, but the techniques generalise. See the trie / autocomplete simulator.
  • Plagiarism detection: checking if any substring of length ≥ k appears in both documents. Build the generalised suffix tree, find common deep paths.
  • de novo genome assembly: reconstruct a genome from short reads by finding overlaps. Many assemblers are built on suffix-array overlap graphs.

For most small-to-medium text-search problems on the web, you'd reach for a inverted index (Lucene, Elasticsearch) instead — those are word-tokenised, not character-by-character, and serve a different question ("which documents contain word X" vs "where in this document does substring X occur"). When you need character-level search at scale, the suffix family is what shows up.

Aho-Corasick: when you have many patterns, not one

When you have many patterns, not one.

A suffix tree solves "find all occurrences of pattern P in text T" by indexing the text. The dual problem ("given many patterns, find every occurrence of any of them in a single pass over the text") is what Aho-Corasick (1975) was designed for. Build a trie of the patterns, then add failure links: from every node u, a pointer to the longest proper suffix of u's string that is also a prefix of some pattern. Then a single linear scan of the text walks the trie, jumping along failure links whenever the next text character doesn't extend the current match. Every pattern occurrence is reported. Total time: O(|text| + |patterns| + #occurrences).

This is the algorithm under fgrep (when given multiple patterns), under intrusion-detection systems matching against thousands of attack signatures, under spam filters scanning bodies for blocklisted phrases. It's also a building block in many bioinformatics pipelines: fast initial filtering before expensive alignment.

The key insight is that the failure-link structure is a kind of "trie with backtracking baked in." Without failure links, a brute-force pattern check on text of length n with m patterns of total length M takes O(n·M); with Aho-Corasick, O(n + M) once the structure is built. The build itself is O(M) using a BFS pass.

Aho-Corasick and suffix trees solve dual problems. Suffix tree: many patterns one at a time against a fixed text. Aho-Corasick: one text against a fixed pattern set. The structures look superficially similar (both are tries-with-extra-links) but the indexing direction is opposite. If your workload streams text against a fixed pattern dictionary, build Aho-Corasick once and stream against it. If your workload runs many ad-hoc pattern queries against fixed text, build the suffix tree (or array) once and query against it.

A second close relative is the suffix automaton (Blumer et al., 1985) — the smallest deterministic finite automaton (DFA) recognising all suffixes of a string. It has at most 2n − 1 states and a beautiful linear-time construction. Suffix automata are favoured over suffix trees in some applications because they're slightly easier to implement online and the state-based representation maps naturally to regular-expression engines.

FM-index: the memory miracle

FM-index and the memory miracle.

For huge texts (genomes, log archives, web crawls), the 20-bytes-per-character cost of a suffix tree is prohibitive. The 4-8 bytes per character of a suffix array is better. The 0.4-2 bytes per character of an FM-index is what makes modern bioinformatics possible.

An FM-index combines the BWT of the text with a few small auxiliary tables (rank arrays, sampled suffix-array values). Pattern search is "backward search": at each step, restrict the range of possible suffix-array positions using the BWT and the rank tables. Each step is O(1) given the right precomputed tables, so total search time is O(m). The clever observation is that the BWT preserves enough information about the original text to answer search questions, even though the BWT itself is a permutation that looks like noise to a casual inspector.

The FM-index of the human genome (~3.2 GB of DNA) is roughly 1.5 GB — small enough to fit in RAM on any modern laptop. This is what powers tools like BWA, Bowtie2, BLAST's index, and most production read aligners. Without compressed indexes, those tools would either need expensive cluster-scale RAM or would have to use slow disk-based access patterns. With them, alignment runs at gigabytes per minute on commodity hardware.

Building an FM-index requires the suffix array first (use SA-IS), then computing the BWT (one pass), then building wavelet trees or similar rank-supporting structures over the BWT. Total build time is O(n log n) practical, O(n) theoretical. The reference implementations are sdsl-lite (C++) and FM-index in scikit-bio (Python). Modern implementations include block-compressed FM-indexes that achieve sub-byte-per-character storage with negligible search overhead.

If you're working with text data above a gigabyte, the FM-index is almost always the right tool. Below that threshold, a suffix array (with optional LCP for tree-equivalent queries) is simpler and faster. The suffix tree itself remains the right pedagogical tool, and the right choice when you specifically need online (streaming) construction. But in production, its variants almost always win.

A curriculum that earns the speedup

A curriculum that earns the speedup.

Suffix trees are notoriously hard to internalise from a textbook. Here's a sequence that builds intuition before code:

  1. Build a regular trie. Insert ten short strings, draw the tree on paper, confirm that prefix sharing collapses correctly.
  2. Build a suffix trie (uncompressed). Take banana$, list all 7 suffixes, insert each one into a trie, count nodes. Notice how many internal chains have only one child.
  3. Compress the suffix trie by hand. Collapse single-child chains into edges with multi-character labels. You've just made a suffix tree the slow way. Count the nodes — at most 2n − 1.
  4. Implement the naive O(n²) build. Insert each suffix into a trie that supports edge splitting on demand. The simulator's source is one such implementation. Test it on mississippi$; verify the leaf count is n + 1.
  5. Implement substring search. Walk root-to-pattern-end, then DFS to enumerate leaves. Profile against naive substring search on a 100k-character text and 1000 random patterns. The speedup should be 100×+.
  6. Build a suffix array (naive). Sort all suffixes lexicographically. O(n² log n) but trivial to write.
  7. Implement DC3 or SA-IS suffix-array construction. Linear-time. The DC3 paper is dense but rewarding; SA-IS is the industry standard.
  8. Compute the LCP array using Kasai's algorithm. Now you have suffix-array + LCP, which gives you most of the suffix-tree functionality at 1/4 the memory.
  9. Apply: longest repeated substring. With suffix tree: find the deepest internal node by string-depth. With suffix array + LCP: max(LCP).
  10. Apply: longest common substring of two strings. Build a generalised suffix tree on S₁ # S₂ $; find the deepest internal node whose subtree contains leaves from both strings.
  11. Build the BWT. With the suffix array, this is a one-line transformation. Run-length-encode the BWT and compare its size against gzip on natural text.
A useful identity

For any two suffixes i and j in the suffix tree, the depth of their lowest common ancestor equals the length of the longest common prefix between them. This is the connection between suffix trees and the LCP array — and it's why range-min queries on the LCP array answer pattern-counting questions in O(log n).

What goes silently wrong with suffix structures

What goes silently wrong.

Forgetting the terminator. Without a unique end-of-string symbol like $, a suffix that's a prefix of another suffix won't end at a leaf — it'll end mid-edge. Some implementations work around this; most don't. Always append a sentinel.

Edge labels stored as strings vs (start, end) pairs. Storing actual substrings on every edge wastes O(n²) bytes in the worst case. Real implementations store (start, end) indices into the original text, lookup characters on demand. This drops memory from quadratic to linear.

The constant matters more than the asymptotic. A suffix tree has ~20 bytes per character of text in production implementations. A suffix array has 4-8 bytes. For a 1GB text, that's 20GB vs 4GB — frequently the difference between "fits in RAM" and "doesn't." Use the array unless you specifically need streaming construction.

Generalised suffix trees on multiple strings need a unique separator per string ($₁, $₂, ...). Reusing one separator across strings introduces phantom matches at the boundaries.

Implementation complexity. A correct Ukkonen takes most experienced engineers 1-2 weeks to get right and pass a test suite. A correct suffix array (using SA-IS) takes a few days. Pick your battles.

Further reading on suffix trees

Primary sources, in order.

Found this useful?