Typeahead / search autocomplete
The search box that suggests "facebook" after you type "fac". Latency budget is brutal — under 100 ms P99 from keystroke to render — and every dropdown impression costs server work. The system splits cleanly into an online read path (find top-k completions for the current prefix, fast) and an offline ranking pipeline (turn query logs into a ranked completion graph, refreshed every few hours).
1 · Clarifying questions
| "Personal or global completions?" | Start with global. Personalisation is a re-ranking on top of the global list, not a different system. |
| "Number of users / QPS?" | 500M DAU. Average user types 5–10 characters per query, fires a request per keystroke. ~1M autocomplete QPS sustained; 5M at peak. |
| "Latency budget?" | P99 under 100 ms from request to first byte. P50 under 30 ms. |
| "How fresh do trending queries need to be?" | 15-minute lag is fine for global. Real-time only matters for breaking events (NBA Finals scoreline) — that's a separate "trending" pipeline. |
| "Multi-language?" | Yes. Per-language index. Detection by user locale; trigram fallback when ambiguous. |
| "What's in scope?" | Just suggestions for the search box. Not the actual search results. Not the click-through to a page. |
| "Safety / abuse?" | Yes — exclude profanity, names of real people that don't pass a threshold, and "do not appear in suggestions" hand-blocklisted strings (legal, child safety). |
2 · Capacity math
| Number | Derivation | Value |
|---|---|---|
| Autocomplete QPS | 500M DAU × 10 queries/day × 7 keystrokes / 86 400 s | ~400k QPS sustained, ~2M peak |
| Vocabulary size | Distinct query strings worth indexing (frequency ≥ threshold) | ~10M global, ~5M per major language |
| Index size on disk | 10M strings × avg 30 bytes + structure overhead | ~500 MB per language — fits in RAM per server |
| Query-log volume | Searches × scaling factor | ~10B query events / day → ~500 GB / day raw |
| Cache hit rate (single-character prefix) | "a", "th", "th" represent > 30% of all prefixes | ~85% with hot 1K prefixes cached at edge |
Two numbers dominate. 400k QPS — every dropdown is a server hit (can't cache user-specific results easily). And 85% edge cacheability — short prefixes are repeated billions of times across users.
3 · The index — what actually stores the completions
Three approaches, in increasing sophistication:
Option A — Naive trie with per-node top-k
Build a prefix trie of all queries. At each node store the top 10 completions that pass through this prefix, pre-computed. A lookup is O(prefix_length) — descend the trie, return the top-k stored at the deepest node reached.
Trie node (conceptual, Go):
type node struct {
children map[byte]*node // or [26]*node for ASCII letters
topK []completion // (text, score) tuples, sorted
}
Lookup("fac"):
walk: root -> 'f' -> 'a' -> 'c'
return node['f']['a']['c'].topK
Time: O(prefix_len)Pro: dead simple, blindingly fast reads. Con: writes are expensive — when a completion's score changes, you have to update every node on its path. Bulk rebuild (every hour or day) is the only sane way to maintain it.
Option B — DAWG / FST (compressed trie)
A directed acyclic word graph compresses common suffixes, cutting memory ~10×. Lucene's Finite State Transducer is the canonical implementation; suggesters like Elasticsearch's "completion suggester" use this. Read latency is the same; build time is slower; memory is dramatically smaller — important when you want to fit a 500M-string index in RAM.
Option C — Key-value lookup (the "boring" choice)
Store completions in a sorted KV (e.g. RocksDB) keyed by the prefix. Per prefix, the
value is the precomputed top-10 list. GET "fac" returns the list. No
trie traversal — the storage layer does it for you. This is what most production
systems actually run, because operationally it's just another KV cluster.
4 · The offline pipeline — where the rankings come from
The interesting part of typeahead is the offline pipeline that turns "raw query logs" into "top-10 completions per prefix". Three stages:
Stage 1 — Aggregate
Stream query events from a Kafka topic into a windowed aggregator (Flink / Spark
Streaming / Beam). For each unique normalised query string, accumulate counts by
window: last hour, last day, last week. Output: (normalised_query, count_h,
count_d, count_w).
Normalisation matters: lowercase, strip punctuation, collapse whitespace, expand emoji to text. Two queries differing only in case are the same query.
Stage 2 — Score
Combine counts with other signals to produce a single score per query:
| Signal | Weight | Why |
|---|---|---|
| Recent volume (last hour) | 0.4 | Captures trending; the "Olympic gold medal" prefix spike |
| Sustained volume (last week) | 0.3 | Stable queries that always perform well |
| Click-through rate | 0.2 | Penalises typeahead suggestions that don't lead to clicks |
| Diversity bonus | 0.05 | Slight nudge so all 10 results aren't the same brand/topic |
| Safety filter | ×0 if blocked | Blocklists, child-safety, profanity, doxxing — multiplicative gate |
Stage 3 — Build the per-prefix top-K
For every query in the scored list and every prefix of that query, push into a min-heap of size K (10) keyed on the score. After processing all queries, the heap at each prefix is the top-K completions for that prefix.
For ~10M queries × avg length 20 prefixes each = 200M heap operations. Embarrassingly parallel by first-character (shard the work into 26 buckets); a few hundred mappers finish it in 30–60 minutes.
Stage 4 — Serve
Atomically swap the new index. Two common patterns:
- Versioned key prefix. Index
v={timestamp}/faccoexists withv={old}/fac. A pointer key indicates which version is "active". Atomically flip the pointer; the new index serves all reads from the next request. - Side-load + restart. Each typeahead server holds the index in RAM. Build the new index in a parallel directory; signal servers to re-mmap; old index drops out. Coordinated rollout to avoid all servers reloading at once.
5 · High-level architecture
The online path is short and synchronous: edge cache → typeahead service → personalise → user. The offline pipeline is the heavy lifting and runs every ~15 minutes. The personalisation sidecar is an optional re-ranking layer that reads from a user-history service.
6 · Deep dive — edge caching
85% of all requests are for the top 1000 prefixes. Caching them at edge (CDN-style POPs) drops origin load by 10× and shaves 30–50 ms off the P99.
| Concern | Approach |
|---|---|
| Cache key | (language, prefix, region) — never the full user identity |
| TTL | Short: 60 seconds. Trending content needs fresh, even though most prefixes don't. |
| Cache size | Top 10k prefixes per language × ~2 KB each ≈ 20 MB — easily fits at every POP |
| Cache miss path | Pass-through to typeahead service. SWR (stale-while-revalidate) so the user never waits for a refresh. |
| Personalisation | Edge can't help with personalised re-ranks. Return the global top-30 from edge, do personalisation client-side (small JS) — or skip personalisation on edge-cached responses. |
| Invalidation | Don't invalidate proactively. Short TTL is enough; safety blocklists update via a side-channel that the typeahead service consults on every uncached miss. |
7 · Personalisation — the optional ML layer
Global top-10 is fine for most users. Personalisation matters for two cases: recent searches (boost queries the user typed in the last week) and interest signal (a user who clicks football news a lot gets "fifa world cup" before "fall season").
The cleanest design: typeahead service returns the global top-30 for the prefix; a re-ranking service applies the personalisation signals and trims to top-10. Re-ranking runs on every uncached miss with a budget of ~5 ms.
8 · Safety & abuse
- Blocklists. A hand-curated set of strings that must never appear (legal threats, doxxing, child-safety). Multiplicative filter at score time — if the query matches a blocklist regex, score becomes zero.
- Classifier. A small ML model that scores each candidate for safety (profanity, harassment patterns). Threshold-gated; below threshold, drop.
- Trending abuse. Real-time pipeline that monitors sudden spikes in low-quality queries (coordinated trolling). Auto-block via a fast feedback loop into the safety service.
- Hand removal. Trust & safety has a button: "remove this suggestion globally". Logged, audited, reversible.
- Personal-name filter. Don't suggest "{person's name} address" or "{person's name} salary" unless the person is a public figure (notability threshold from external knowledge graph).
9 · Failure modes
| Failure | Symptom | Runbook |
|---|---|---|
| Index build fails | Yesterday's index serves; no new trending captured | Alert at 4 h beyond schedule. Investigate Flink job. Stale index is graceful degradation, not an outage. |
| Edge cache cluster down | P99 climbs from 50 ms to 200 ms | Origin takes the brunt; auto-scale typeahead service. SLO breached but no outage. |
| Origin overload | 5xx rate climbing | Return cached responses past their TTL; graceful degradation to "no suggestions" rather than error. |
| Bad index pushed | Quality regression in A/B | Atomic rollback by flipping the version pointer. New index keeps building offline. |
| Sudden trending spike | 15-minute index lag misses fresh content | Trending-overlay pipeline: real-time top-100 trending queries, served from a side-cache, merged into responses. |
| Personalisation service slow | P99 climbs only for logged-in users | Circuit-break personalisation; fall back to global top-10. |
10 · Cost & operability
| Component | Driver | ~Order |
|---|---|---|
| Edge cache (CDN) | ~85% hit rate; ~5M peak QPS × ~$0.01 / GB egress | ~$30k / month |
| Typeahead service | ~50 cores × $200/mo across regions | ~$10k / month |
| Offline pipeline (Flink) | ~200 cores, runs ~25% of the day | ~$5k / month |
| Index storage | ~5 GB × redundancy | ~$200 / month |
| Personalisation | Re-rank service + user-embeddings store | ~$15k / month |
| Safety pipeline | Classifier + blocklist + audit | ~$5k / month |
The CDN egress bill dominates. Reduce it with shorter responses (just the top-5, not top-30); compress with Brotli; deduplicate the JSON envelope.
11 · Trade-offs & what would change at 10×
| If… | Then… |
|---|---|
| 10× QPS | Multi-region edge cache (already there); add per-language pools so trending in Japan doesn't pollute the US cache. Tune typeahead service to less per-request work; pre-compute personalisation features. |
| Sub-25-ms P99 | Move the index closer to the user. Each edge POP runs a typeahead-service replica with the language-relevant index in RAM. The origin is just for cache misses on long-tail prefixes. |
| Real-time trending integrated | The offline pipeline goes from 15 min to 1 min. Or split: 15-min index for the long tail, 1-min "trending overlay" for the top 100 prefixes. |
| Multi-modal (image / voice queries) | Voice typeahead is a different system (acoustic model + text completion fusion). Image is a different problem entirely. Keep them on separate paths. |
| What would a senior+ answer add? | The A/B-test infrastructure. Typeahead has hundreds of running experiments. Each request is bucketed; results logged; CTR / engagement metrics fed back into the scoring function. Acknowledging this — even briefly — is what separates a competent answer from a memorable one. |
Further reading
- Elasticsearch — "Completion Suggester". Production reference for FST-based typeahead. The official guide is small and worth reading once.
- Apache Lucene — Robert Muir, "FSTs in Lucene". The data-structure background for FST/DAWG-based suggesters.
- Google — "How Google Search builds suggestions". A surprisingly candid 2017 SRE blog post on the offline pipeline architecture.
- Bing autosuggest team — "Building autosuggest at scale". Discusses the rank-by-CTR feedback loop and safety filtering.
- LinkedIn — "Building typeahead with three sources". Engineering-blog post on combining global + connection-graph + personal sources.
- Adjacent: LSM trees. The storage engine under most KV-style suggester indexes.
- Adjacent: Trie data structure. The mechanism explainer.
- Adjacent: Top-k / trending. The pipeline shape reuses many ideas.