09 / 19

Playbook / 09

Typeahead / search autocomplete

The search box that suggests "facebook" after you type "fac". Latency budget is brutal — under 100 ms P99 from keystroke to render — and every dropdown impression costs server work. The system splits cleanly into an online read path (find top-k completions for the current prefix, fast) and an offline ranking pipeline (turn query logs into a ranked completion graph, refreshed every few hours).

1 · Clarifying questions

"Personal or global completions?"	Start with global. Personalisation is a re-ranking on top of the global list, not a different system.
"Number of users / QPS?"	500M DAU. Average user types 5–10 characters per query, fires a request per keystroke. ~1M autocomplete QPS sustained; 5M at peak.
"Latency budget?"	P99 under 100 ms from request to first byte. P50 under 30 ms.
"How fresh do trending queries need to be?"	15-minute lag is fine for global. Real-time only matters for breaking events (NBA Finals scoreline) — that's a separate "trending" pipeline.
"Multi-language?"	Yes. Per-language index. Detection by user locale; trigram fallback when ambiguous.
"What's in scope?"	Just suggestions for the search box. Not the actual search results. Not the click-through to a page.
"Safety / abuse?"	Yes — exclude profanity, names of real people that don't pass a threshold, and "do not appear in suggestions" hand-blocklisted strings (legal, child safety).

2 · Capacity math

Number	Derivation	Value
Autocomplete QPS	500M DAU × 10 queries/day × 7 keystrokes / 86 400 s	~400k QPS sustained, ~2M peak
Vocabulary size	Distinct query strings worth indexing (frequency ≥ threshold)	~10M global, ~5M per major language
Index size on disk	10M strings × avg 30 bytes + structure overhead	~500 MB per language — fits in RAM per server
Query-log volume	Searches × scaling factor	~10B query events / day → ~500 GB / day raw
Cache hit rate (single-character prefix)	"a", "th", "th" represent > 30% of all prefixes	~85% with hot 1K prefixes cached at edge

Two numbers dominate. 400k QPS — every dropdown is a server hit (can't cache user-specific results easily). And 85% edge cacheability — short prefixes are repeated billions of times across users.

3 · The index — what actually stores the completions

Three approaches, in increasing sophistication:

Option A — Naive trie with per-node top-k

Build a prefix trie of all queries. At each node store the top 10 completions that pass through this prefix, pre-computed. A lookup is O(prefix_length) — descend the trie, return the top-k stored at the deepest node reached.

Trie node (conceptual, Go):
    type node struct {
        children map[byte]*node    // or [26]*node for ASCII letters
        topK     []completion       // (text, score) tuples, sorted
    }

Lookup("fac"):
    walk: root -> 'f' -> 'a' -> 'c'
    return node['f']['a']['c'].topK
Time: O(prefix_len)

Pro: dead simple, blindingly fast reads. Con: writes are expensive — when a completion's score changes, you have to update every node on its path. Bulk rebuild (every hour or day) is the only sane way to maintain it.

Option B — DAWG / FST (compressed trie)

A directed acyclic word graph compresses common suffixes, cutting memory ~10×. Lucene's Finite State Transducer is the canonical implementation; suggesters like Elasticsearch's "completion suggester" use this. Read latency is the same; build time is slower; memory is dramatically smaller — important when you want to fit a 500M-string index in RAM.

Option C — Key-value lookup (the "boring" choice)

Store completions in a sorted KV (e.g. RocksDB) keyed by the prefix. Per prefix, the value is the precomputed top-10 list. GET "fac" returns the list. No trie traversal — the storage layer does it for you. This is what most production systems actually run, because operationally it's just another KV cluster.

The interview signal. Naming all three (trie / FST / KV) and defending one is the answer. Naive trie is fine for the on-whiteboard answer; real-world it's almost always (b) or (c). The decisive factor isn't latency — all three are sub-millisecond — it's operability: rolling out a new index, rolling it back, A/B-testing two versions side by side. KV-with-versioned-keys is much easier to operate than rebuilding a trie.

4 · The offline pipeline — where the rankings come from

The interesting part of typeahead is the offline pipeline that turns "raw query logs" into "top-10 completions per prefix". Three stages:

Stage 1 — Aggregate

Stream query events from a Kafka topic into a windowed aggregator (Flink / Spark Streaming / Beam). For each unique normalised query string, accumulate counts by window: last hour, last day, last week. Output: (normalised_query, count_h, count_d, count_w).

Normalisation matters: lowercase, strip punctuation, collapse whitespace, expand emoji to text. Two queries differing only in case are the same query.

Stage 2 — Score

Combine counts with other signals to produce a single score per query:

Signal	Weight	Why
Recent volume (last hour)	0.4	Captures trending; the "Olympic gold medal" prefix spike
Sustained volume (last week)	0.3	Stable queries that always perform well
Click-through rate	0.2	Penalises typeahead suggestions that don't lead to clicks
Diversity bonus	0.05	Slight nudge so all 10 results aren't the same brand/topic
Safety filter	×0 if blocked	Blocklists, child-safety, profanity, doxxing — multiplicative gate

Stage 3 — Build the per-prefix top-K

For every query in the scored list and every prefix of that query, push into a min-heap of size K (10) keyed on the score. After processing all queries, the heap at each prefix is the top-K completions for that prefix.

For ~10M queries × avg length 20 prefixes each = 200M heap operations. Embarrassingly parallel by first-character (shard the work into 26 buckets); a few hundred mappers finish it in 30–60 minutes.

Stage 4 — Serve

Atomically swap the new index. Two common patterns:

Versioned key prefix. Index v={timestamp}/fac coexists with v={old}/fac. A pointer key indicates which version is "active". Atomically flip the pointer; the new index serves all reads from the next request.
Side-load + restart. Each typeahead server holds the index in RAM. Build the new index in a parallel directory; signal servers to re-mmap; old index drops out. Coordinated rollout to avoid all servers reloading at once.

5 · High-level architecture

The online path is short and synchronous: edge cache → typeahead service → personalise → user. The offline pipeline is the heavy lifting and runs every ~15 minutes. The personalisation sidecar is an optional re-ranking layer that reads from a user-history service.

6 · Deep dive — edge caching

85% of all requests are for the top 1000 prefixes. Caching them at edge (CDN-style POPs) drops origin load by 10× and shaves 30–50 ms off the P99.

Concern	Approach
Cache key	(language, prefix, region) — never the full user identity
TTL	Short: 60 seconds. Trending content needs fresh, even though most prefixes don't.
Cache size	Top 10k prefixes per language × ~2 KB each ≈ 20 MB — easily fits at every POP
Cache miss path	Pass-through to typeahead service. SWR (stale-while-revalidate) so the user never waits for a refresh.
Personalisation	Edge can't help with personalised re-ranks. Return the global top-30 from edge, do personalisation client-side (small JS) — or skip personalisation on edge-cached responses.
Invalidation	Don't invalidate proactively. Short TTL is enough; safety blocklists update via a side-channel that the typeahead service consults on every uncached miss.

7 · Personalisation — the optional ML layer

Global top-10 is fine for most users. Personalisation matters for two cases: recent searches (boost queries the user typed in the last week) and interest signal (a user who clicks football news a lot gets "fifa world cup" before "fall season").

The cleanest design: typeahead service returns the global top-30 for the prefix; a re-ranking service applies the personalisation signals and trims to top-10. Re-ranking runs on every uncached miss with a budget of ~5 ms.

The latency trap. Personalisation can't be in the cache key — every user is different — so personalised responses bypass edge. Decide upfront: which users get personalisation, which don't? Most production systems personalise only the logged-in users on the heaviest devices, where the ~30 ms latency hit is acceptable.

8 · Safety & abuse

Blocklists. A hand-curated set of strings that must never appear (legal threats, doxxing, child-safety). Multiplicative filter at score time — if the query matches a blocklist regex, score becomes zero.
Classifier. A small ML model that scores each candidate for safety (profanity, harassment patterns). Threshold-gated; below threshold, drop.
Trending abuse. Real-time pipeline that monitors sudden spikes in low-quality queries (coordinated trolling). Auto-block via a fast feedback loop into the safety service.
Hand removal. Trust & safety has a button: "remove this suggestion globally". Logged, audited, reversible.
Personal-name filter. Don't suggest "{person's name} address" or "{person's name} salary" unless the person is a public figure (notability threshold from external knowledge graph).

9 · Failure modes

Failure	Symptom	Runbook
Index build fails	Yesterday's index serves; no new trending captured	Alert at 4 h beyond schedule. Investigate Flink job. Stale index is graceful degradation, not an outage.
Edge cache cluster down	P99 climbs from 50 ms to 200 ms	Origin takes the brunt; auto-scale typeahead service. SLO breached but no outage.
Origin overload	5xx rate climbing	Return cached responses past their TTL; graceful degradation to "no suggestions" rather than error.
Bad index pushed	Quality regression in A/B	Atomic rollback by flipping the version pointer. New index keeps building offline.
Sudden trending spike	15-minute index lag misses fresh content	Trending-overlay pipeline: real-time top-100 trending queries, served from a side-cache, merged into responses.
Personalisation service slow	P99 climbs only for logged-in users	Circuit-break personalisation; fall back to global top-10.

10 · Cost & operability

Component	Driver	~Order
Edge cache (CDN)	~85% hit rate; ~5M peak QPS × ~$0.01 / GB egress	~$30k / month
Typeahead service	~50 cores × $200/mo across regions	~$10k / month
Offline pipeline (Flink)	~200 cores, runs ~25% of the day	~$5k / month
Index storage	~5 GB × redundancy	~$200 / month
Personalisation	Re-rank service + user-embeddings store	~$15k / month
Safety pipeline	Classifier + blocklist + audit	~$5k / month

The CDN egress bill dominates. Reduce it with shorter responses (just the top-5, not top-30); compress with Brotli; deduplicate the JSON envelope.

11 · Trade-offs & what would change at 10×

If…	Then…
10× QPS	Multi-region edge cache (already there); add per-language pools so trending in Japan doesn't pollute the US cache. Tune typeahead service to less per-request work; pre-compute personalisation features.
Sub-25-ms P99	Move the index closer to the user. Each edge POP runs a typeahead-service replica with the language-relevant index in RAM. The origin is just for cache misses on long-tail prefixes.
Real-time trending integrated	The offline pipeline goes from 15 min to 1 min. Or split: 15-min index for the long tail, 1-min "trending overlay" for the top 100 prefixes.
Multi-modal (image / voice queries)	Voice typeahead is a different system (acoustic model + text completion fusion). Image is a different problem entirely. Keep them on separate paths.
What would a senior+ answer add?	The A/B-test infrastructure. Typeahead has hundreds of running experiments. Each request is bucketed; results logged; CTR / engagement metrics fed back into the scoring function. Acknowledging this — even briefly — is what separates a competent answer from a memorable one.

Typeahead / search autocomplete

1 · Clarifying questions

2 · Capacity math

3 · The index — what actually stores the completions

Option A — Naive trie with per-node top-k

Option B — DAWG / FST (compressed trie)

Option C — Key-value lookup (the "boring" choice)

4 · The offline pipeline — where the rankings come from

Stage 1 — Aggregate

Stage 2 — Score

Stage 3 — Build the per-prefix top-K

Stage 4 — Serve

5 · High-level architecture

6 · Deep dive — edge caching

7 · Personalisation — the optional ML layer

8 · Safety & abuse

9 · Failure modes

10 · Cost & operability

11 · Trade-offs & what would change at 10×

Further reading

Next — Web crawler