09 / 19
Playbook / 09

Typeahead / search autocomplete

The search box that suggests "facebook" after you type "fac". Latency budget is brutal — under 100 ms P99 from keystroke to render — and every dropdown impression costs server work. The system splits cleanly into an online read path (find top-k completions for the current prefix, fast) and an offline ranking pipeline (turn query logs into a ranked completion graph, refreshed every few hours).


1 · Clarifying questions

"Personal or global completions?"Start with global. Personalisation is a re-ranking on top of the global list, not a different system.
"Number of users / QPS?"500M DAU. Average user types 5–10 characters per query, fires a request per keystroke. ~1M autocomplete QPS sustained; 5M at peak.
"Latency budget?"P99 under 100 ms from request to first byte. P50 under 30 ms.
"How fresh do trending queries need to be?"15-minute lag is fine for global. Real-time only matters for breaking events (NBA Finals scoreline) — that's a separate "trending" pipeline.
"Multi-language?"Yes. Per-language index. Detection by user locale; trigram fallback when ambiguous.
"What's in scope?"Just suggestions for the search box. Not the actual search results. Not the click-through to a page.
"Safety / abuse?"Yes — exclude profanity, names of real people that don't pass a threshold, and "do not appear in suggestions" hand-blocklisted strings (legal, child safety).

2 · Capacity math

NumberDerivationValue
Autocomplete QPS500M DAU × 10 queries/day × 7 keystrokes / 86 400 s~400k QPS sustained, ~2M peak
Vocabulary sizeDistinct query strings worth indexing (frequency ≥ threshold)~10M global, ~5M per major language
Index size on disk10M strings × avg 30 bytes + structure overhead~500 MB per language — fits in RAM per server
Query-log volumeSearches × scaling factor~10B query events / day → ~500 GB / day raw
Cache hit rate (single-character prefix)"a", "th", "th" represent > 30% of all prefixes~85% with hot 1K prefixes cached at edge

Two numbers dominate. 400k QPS — every dropdown is a server hit (can't cache user-specific results easily). And 85% edge cacheability — short prefixes are repeated billions of times across users.

3 · The index — what actually stores the completions

Three approaches, in increasing sophistication:

Option A — Naive trie with per-node top-k

Build a prefix trie of all queries. At each node store the top 10 completions that pass through this prefix, pre-computed. A lookup is O(prefix_length) — descend the trie, return the top-k stored at the deepest node reached.

Trie node (conceptual, Go):
    type node struct {
        children map[byte]*node    // or [26]*node for ASCII letters
        topK     []completion       // (text, score) tuples, sorted
    }

Lookup("fac"):
    walk: root -> 'f' -> 'a' -> 'c'
    return node['f']['a']['c'].topK
Time: O(prefix_len)

Pro: dead simple, blindingly fast reads. Con: writes are expensive — when a completion's score changes, you have to update every node on its path. Bulk rebuild (every hour or day) is the only sane way to maintain it.

Option B — DAWG / FST (compressed trie)

A directed acyclic word graph compresses common suffixes, cutting memory ~10×. Lucene's Finite State Transducer is the canonical implementation; suggesters like Elasticsearch's "completion suggester" use this. Read latency is the same; build time is slower; memory is dramatically smaller — important when you want to fit a 500M-string index in RAM.

Option C — Key-value lookup (the "boring" choice)

Store completions in a sorted KV (e.g. RocksDB) keyed by the prefix. Per prefix, the value is the precomputed top-10 list. GET "fac" returns the list. No trie traversal — the storage layer does it for you. This is what most production systems actually run, because operationally it's just another KV cluster.

The interview signal. Naming all three (trie / FST / KV) and defending one is the answer. Naive trie is fine for the on-whiteboard answer; real-world it's almost always (b) or (c). The decisive factor isn't latency — all three are sub-millisecond — it's operability: rolling out a new index, rolling it back, A/B-testing two versions side by side. KV-with-versioned-keys is much easier to operate than rebuilding a trie.

4 · The offline pipeline — where the rankings come from

The interesting part of typeahead is the offline pipeline that turns "raw query logs" into "top-10 completions per prefix". Three stages:

Stage 1 — Aggregate

Stream query events from a Kafka topic into a windowed aggregator (Flink / Spark Streaming / Beam). For each unique normalised query string, accumulate counts by window: last hour, last day, last week. Output: (normalised_query, count_h, count_d, count_w).

Normalisation matters: lowercase, strip punctuation, collapse whitespace, expand emoji to text. Two queries differing only in case are the same query.

Stage 2 — Score

Combine counts with other signals to produce a single score per query:

SignalWeightWhy
Recent volume (last hour)0.4Captures trending; the "Olympic gold medal" prefix spike
Sustained volume (last week)0.3Stable queries that always perform well
Click-through rate0.2Penalises typeahead suggestions that don't lead to clicks
Diversity bonus0.05Slight nudge so all 10 results aren't the same brand/topic
Safety filter×0 if blockedBlocklists, child-safety, profanity, doxxing — multiplicative gate

Stage 3 — Build the per-prefix top-K

For every query in the scored list and every prefix of that query, push into a min-heap of size K (10) keyed on the score. After processing all queries, the heap at each prefix is the top-K completions for that prefix.

For ~10M queries × avg length 20 prefixes each = 200M heap operations. Embarrassingly parallel by first-character (shard the work into 26 buckets); a few hundred mappers finish it in 30–60 minutes.

Stage 4 — Serve

Atomically swap the new index. Two common patterns:

  • Versioned key prefix. Index v={timestamp}/fac coexists with v={old}/fac. A pointer key indicates which version is "active". Atomically flip the pointer; the new index serves all reads from the next request.
  • Side-load + restart. Each typeahead server holds the index in RAM. Build the new index in a parallel directory; signal servers to re-mmap; old index drops out. Coordinated rollout to avoid all servers reloading at once.

5 · High-level architecture

ONLINE — read pathSearch boxEdge cacheprefix → top-10Typeahead svctrie / KV lookupPersonalisere-rank top-30UserOFFLINE — index pipelineKafka — searchFlink aggwindowed countsScore & filterCTR, safety, decayBuild top-Kper-prefix heapIndexPERSONALISATION SIDECARUser history svcrecent queriesUser embeddingsinterest vectorsRe-rank modelML, ~5 msSAFETYBlocklist + classifier sidecar

The online path is short and synchronous: edge cache → typeahead service → personalise → user. The offline pipeline is the heavy lifting and runs every ~15 minutes. The personalisation sidecar is an optional re-ranking layer that reads from a user-history service.

6 · Deep dive — edge caching

85% of all requests are for the top 1000 prefixes. Caching them at edge (CDN-style POPs) drops origin load by 10× and shaves 30–50 ms off the P99.

ConcernApproach
Cache key(language, prefix, region) — never the full user identity
TTLShort: 60 seconds. Trending content needs fresh, even though most prefixes don't.
Cache sizeTop 10k prefixes per language × ~2 KB each ≈ 20 MB — easily fits at every POP
Cache miss pathPass-through to typeahead service. SWR (stale-while-revalidate) so the user never waits for a refresh.
PersonalisationEdge can't help with personalised re-ranks. Return the global top-30 from edge, do personalisation client-side (small JS) — or skip personalisation on edge-cached responses.
InvalidationDon't invalidate proactively. Short TTL is enough; safety blocklists update via a side-channel that the typeahead service consults on every uncached miss.

7 · Personalisation — the optional ML layer

Global top-10 is fine for most users. Personalisation matters for two cases: recent searches (boost queries the user typed in the last week) and interest signal (a user who clicks football news a lot gets "fifa world cup" before "fall season").

The cleanest design: typeahead service returns the global top-30 for the prefix; a re-ranking service applies the personalisation signals and trims to top-10. Re-ranking runs on every uncached miss with a budget of ~5 ms.

The latency trap. Personalisation can't be in the cache key — every user is different — so personalised responses bypass edge. Decide upfront: which users get personalisation, which don't? Most production systems personalise only the logged-in users on the heaviest devices, where the ~30 ms latency hit is acceptable.

8 · Safety & abuse

  • Blocklists. A hand-curated set of strings that must never appear (legal threats, doxxing, child-safety). Multiplicative filter at score time — if the query matches a blocklist regex, score becomes zero.
  • Classifier. A small ML model that scores each candidate for safety (profanity, harassment patterns). Threshold-gated; below threshold, drop.
  • Trending abuse. Real-time pipeline that monitors sudden spikes in low-quality queries (coordinated trolling). Auto-block via a fast feedback loop into the safety service.
  • Hand removal. Trust & safety has a button: "remove this suggestion globally". Logged, audited, reversible.
  • Personal-name filter. Don't suggest "{person's name} address" or "{person's name} salary" unless the person is a public figure (notability threshold from external knowledge graph).

9 · Failure modes

FailureSymptomRunbook
Index build failsYesterday's index serves; no new trending capturedAlert at 4 h beyond schedule. Investigate Flink job. Stale index is graceful degradation, not an outage.
Edge cache cluster downP99 climbs from 50 ms to 200 msOrigin takes the brunt; auto-scale typeahead service. SLO breached but no outage.
Origin overload5xx rate climbingReturn cached responses past their TTL; graceful degradation to "no suggestions" rather than error.
Bad index pushedQuality regression in A/BAtomic rollback by flipping the version pointer. New index keeps building offline.
Sudden trending spike15-minute index lag misses fresh contentTrending-overlay pipeline: real-time top-100 trending queries, served from a side-cache, merged into responses.
Personalisation service slowP99 climbs only for logged-in usersCircuit-break personalisation; fall back to global top-10.

10 · Cost & operability

ComponentDriver~Order
Edge cache (CDN)~85% hit rate; ~5M peak QPS × ~$0.01 / GB egress~$30k / month
Typeahead service~50 cores × $200/mo across regions~$10k / month
Offline pipeline (Flink)~200 cores, runs ~25% of the day~$5k / month
Index storage~5 GB × redundancy~$200 / month
PersonalisationRe-rank service + user-embeddings store~$15k / month
Safety pipelineClassifier + blocklist + audit~$5k / month

The CDN egress bill dominates. Reduce it with shorter responses (just the top-5, not top-30); compress with Brotli; deduplicate the JSON envelope.

11 · Trade-offs & what would change at 10×

If…Then…
10× QPSMulti-region edge cache (already there); add per-language pools so trending in Japan doesn't pollute the US cache. Tune typeahead service to less per-request work; pre-compute personalisation features.
Sub-25-ms P99Move the index closer to the user. Each edge POP runs a typeahead-service replica with the language-relevant index in RAM. The origin is just for cache misses on long-tail prefixes.
Real-time trending integratedThe offline pipeline goes from 15 min to 1 min. Or split: 15-min index for the long tail, 1-min "trending overlay" for the top 100 prefixes.
Multi-modal (image / voice queries)Voice typeahead is a different system (acoustic model + text completion fusion). Image is a different problem entirely. Keep them on separate paths.
What would a senior+ answer add?The A/B-test infrastructure. Typeahead has hundreds of running experiments. Each request is bucketed; results logged; CTR / engagement metrics fed back into the scoring function. Acknowledging this — even briefly — is what separates a competent answer from a memorable one.

Further reading

  • Elasticsearch — "Completion Suggester". Production reference for FST-based typeahead. The official guide is small and worth reading once.
  • Apache Lucene — Robert Muir, "FSTs in Lucene". The data-structure background for FST/DAWG-based suggesters.
  • Google — "How Google Search builds suggestions". A surprisingly candid 2017 SRE blog post on the offline pipeline architecture.
  • Bing autosuggest team — "Building autosuggest at scale". Discusses the rank-by-CTR feedback loop and safety filtering.
  • LinkedIn — "Building typeahead with three sources". Engineering-blog post on combining global + connection-graph + personal sources.
  • Adjacent: LSM trees. The storage engine under most KV-style suggester indexes.
  • Adjacent: Trie data structure. The mechanism explainer.
  • Adjacent: Top-k / trending. The pipeline shape reuses many ideas.
Found this useful?