Top-k / trending
The problem the interviewer phrases as "trending tweets" or "the 100 most-viewed videos in the last hour". Underneath, it is a question about counting at a scale where exact counts stop being practical — and about reconciling a fast, approximate real-time path with a slower, exact batch path. The core ideas are count-min sketches, a small heap of heavy hitters, and a decay window that defines what "now" means.
1 · Clarifying questions
The word "top" is doing a lot of work in the prompt. Five minutes of questions usually settles the design before any boxes go on the board.
| What does "top" mean? | Raw count of events, or a weighted engagement score (like + 3× retweet + 5× reply)? Different aggregation primitive; sometimes a different sketch. |
| What's the time window? | Last hour, last 24 hours, since midnight? Sliding (continuous) or tumbling (per fixed bucket)? Sliding is harder; tumbling composes from per-minute counts. |
| How accurate does it need to be? | Exact ordering, or "95% of the time the top-100 is correct"? Approximate counters trade memory for a known error bound. |
| How many distinct items? | Cardinality M drives the sketch size. 1M items behaves very differently from 1B. |
| How many top results? | k = 10 fits trivially in memory. k = 100K is a different problem — usually backed by a sorted index, not a heap. |
| Freshness budget? | Real-time (sub-second), near-real-time (a minute late), or batch (hourly)? Each tier removes a layer of the architecture. |
| Personalised or global? | Global is one query. Per-user trending is a different shape — usually a candidate generator plus a ranking model, out of scope here. |
| Read traffic on the answer? | If the top-k page itself is hot, cache the materialised answer per window rather than recomputing on every request. |
2 · Capacity math, on a napkin
Pick numbers that justify the sketch. If the answer fits in a hash map, you don't need any of this — say so out loud, then build the version that doesn't.
| Number | Calculation | Result |
|---|---|---|
| Events / sec (peak) | given — likes, views, clicks | ~500M |
| Distinct items in window | active tweets / videos | ~100M |
| k (top results) | given | 100 |
| Bytes / event | id + type + ts | ~50 B |
| Ingest bandwidth | 500M × 50 B | ~25 GB/sec |
| Exact counter table | 100M items × ~256 B row | ~25 GB per window |
| Count-min sketch | 5 hashes × ~2M cells × 8 B | ~80 MB per window |
| Heavy-hitters heap | 1000 candidates × ~64 B | ~64 KB per window |
| Storage horizon, hot | per-minute rollups, 7 days | ~10,000 buckets |
| Storage horizon, warm | per-hour rollups, 90 days | ~2,160 buckets |
The two numbers that matter are 25 GB/sec of ingest and 100M distinct items per window. The first sets the scale of the ingest tier; the second is why we reach for a sketch rather than a hash map of exact counters.
3 · API and data model
The wire surface is small. The state behind it is where the design lives.
Endpoints
POST /v1/event # fire-and-forget, batched by client
{"item_id":"t_8AQ9", "type":"view", "ts":1715990400}
→ 202
GET /v1/top?window=1h&k=100 # the read path
→ 200 {"window":"1h","computed_at":"2026-05-18T12:03:00Z",
"results":[{"item_id":"t_8AQ9","score":4129312}, ...]}
GET /v1/top?window=24h&k=10&segment=us-en
→ 200 {...} # optional segment dimensionState
per-window state (real-time path)
cms count-min sketch # fixed-size matrix of counters
candidates min-heap of size k+δ # heavy-hitters set
decay per-bucket timestamps # for sliding-window roll-off
materialised answer
top_k:{window} list[item_id, score] # cached, regenerated every N seconds
historical store (batch path)
events_raw # all events, partitioned by minute
rollup_1m, rollup_1h # exact counts, computed offlineThe read path never touches the sketch directly. It reads the materialised top-k list, which the real-time worker regenerates from the sketch and the heavy-hitter heap on a fixed cadence (every 5–15 seconds is typical).
4 · High-level architecture
Four tiers. Ingest absorbs the firehose. Real-time workers maintain the sketch and the heap. A historical store keeps exact counts for backfill and audit. A small query API serves the cached answer.
Real-time and batch run side by side. The real-time worker maintains the sketch and emits the materialised top-k into Redis. The batch path reads the raw event log from Kafka into ClickHouse, computes exact rollups, and reconciles the real-time numbers within a few minutes.
5 · The hard part — count-min sketch and heavy hitters
Exact counting at this scale is impractical: 100M distinct items per window times a counter cell is ~25 GB of state, replicated, per shard, per window. The count-min sketch trades that for a fixed-size matrix and a known overestimate.
A count-min sketch is a matrix of d rows by w columns of
integer counters, plus d independent hash functions. To record an event for
item x, increment cell[i][h_i(x) mod w] for every row. To
estimate the count of x, take the minimum across the d cells
it hashed into. The minimum is the tightest bound because every collision is an
overestimate, never an underestimate.
# count-min sketch — increment and query
class CMS:
def __init__(self, d=5, w=2_000_000):
self.d, self.w = d, w
self.cells = [[0] * w for _ in range(d)]
self.seeds = [random_seed() for _ in range(d)]
def add(self, x, weight=1):
for i in range(self.d):
j = hash(self.seeds[i], x) % self.w
self.cells[i][j] += weight
def estimate(self, x):
return min(self.cells[i][hash(self.seeds[i], x) % self.w]
for i in range(self.d))With w = 2M and d = 5, the error bound is roughly
2N / w with probability 1 - (1/2)^d. For 500M events/sec
into a one-hour window, that's an overestimate of a few thousand per item with
~97% confidence — small relative to the millions of counts that the top-100 items
actually carry.
The sketch alone doesn't tell you who is in the top-k; it only answers point queries.
The heavy-hitters trick is to keep a small min-heap of size k + δ
alongside the sketch. On every increment, look up the item's current estimate; if it
exceeds the heap's minimum, promote it into the heap and evict the current minimum.
The heap converges on the true top-k as long as δ is generous enough
to absorb the sketch's overestimate.
Decay and windowing
For tumbling windows, keep a separate sketch per bucket; drop the bucket when it
rolls off. For sliding windows, the cleanest pattern is a ring of small sketches
(per-minute) that the worker sums when serving a query — at the cost of a small
additional error from summing independent over-estimates. Exponential decay (multiply
every counter by 0.99 per tick) is tempting but tends to bias toward
recent items in a way that's hard to reason about under a sketch's error bound.
6 · Failure modes & runbook
What goes wrong, what oncall sees, what the runbook says. The interview-room version of the operational story.
| Failure | Symptom | Mitigation |
|---|---|---|
| Sketch poisoning | An attacker engineers item IDs that all collide on the same cells, inflating their estimate | Seed the hash functions per shard at startup; rotate seeds on suspicion; rate-limit per source ID upstream of the sketch. |
| Counter overflow | A viral item drives a cell beyond 232; further increments wrap to zero | Use 64-bit counters by default; alert on any cell crossing 80% of the cap; auto-promote the window to wider counters. |
| Lost events | Worker crashes mid-batch; Kafka redelivers, counts double | Dedupe by event ID in a per-window Bloom filter; idempotent increments keyed on (item_id, event_id). |
| Backfill drift | The batch-computed top-100 disagrees with the real-time top-100 | Expected within the sketch's error bound. Alert only when the symmetric difference exceeds a threshold (say, 5 items). Investigate as a correctness bug, not a glitch. |
| Heap saturation | Heap candidates churn faster than the sketch can rank them; top-k oscillates | Increase δ (oversample the heap); add hysteresis so a candidate must beat the threshold for N consecutive ticks before promotion. |
| Kafka partition skew | One item drives most traffic; one worker pegs at 100% CPU | Partition by hash(item_id) by default; for known hot items, fall back to round-robin and reconcile via a secondary sketch. |
| Cache stale on query | The materialised top-k is older than the freshness SLO | Workers heartbeat the cache write; query API serves a "stale" flag past 30 s and falls back to a previous window. |
7 · Cost & operability
The pleasant surprise of the sketch is that its memory is fixed regardless of how many distinct items show up. A few hundred MB per window covers the entire workload. The bill is dominated by the ingest tier and the historical store.
| Line | Estimate | Note |
|---|---|---|
| Kafka (500M ev/s, 7-day retention) | ~$40K / month | Largest line. Compression and protobuf encoding push it down ~3×. |
| Real-time workers | ~$8K / month | ~200 pods; sketch + heap fits in < 1 GB each. |
| Sketch memory | negligible | ~80 MB × ~24 windows × replicas — fits in worker RAM. |
| Batch tier (Spark / Flink) | ~$6K / month | Per-minute rollups; idle most of the time. |
| ClickHouse (rollups, 90 days) | ~$5K / month | Columnar; compresses ~10× on event data. |
| Top-k cache (Redis) | ~$200 / month | A few MB per window; trivial. |
| Total | ~$60K / month | Per million MAU, similar to a heavy timeline product. |
Tuning the sketch dimensions is the main operability lever. Wider sketches (larger
w) reduce the error proportionally but cost RAM linearly; more rows
(larger d) tighten the confidence bound but cost extra hash work per
event. Most production systems land near d = 5 and pick w
to deliver the accuracy the product actually asks for.
8 · Trade-offs & what's next
The interesting follow-ups split into architectural choices and product extensions.
| If… | Then… |
|---|---|
| You want one pipeline instead of two | The kappa architecture: keep only the streaming pipeline, replay history through it when the schema or aggregation changes. Lower operational surface; harder backfills. |
| You want reconciliation guarantees | The lambda architecture: serve the real-time top-k now, overwrite it with the batch result within a few minutes. Two paths, two codepaths, two opportunities to drift. |
| The product wants personalised top-k | The global sketch becomes a candidate generator; a per-user ranking model reorders. That's a different system — a feature store and a model-serving tier replace the heap. |
| The product wants per-segment top-k | Maintain one sketch per segment (country, language, topic). Memory scales linearly with the segment count; cap segments aggressively. |
| Exact counts are required | Drop the sketch. Use a sharded counter table with a hash map per shard. Capacity math changes by orders of magnitude; the product asks for it less often than you'd think. |
| "What would a more senior answer add?" | The anti-abuse layer: bot filtering, source-IP rate limits, signal de-weighting. Top-k rankings are an attractive target; the sketch's correctness depends on the inputs being honest. |
Further reading
- Cormode & Muthukrishnan — "An Improved Data Stream Summary: The Count-Min Sketch and its Applications" (2005). The original paper. Short and readable; the bound proofs are worth the hour.
- Manku & Motwani — "Approximate Frequency Counts over Data Streams" (2002). The lossy-counting algorithm; an alternative to CMS + heap with a different error model.
- Metwally, Agrawal, El Abbadi — "Efficient Computation of Frequent and Top-k Elements in Data Streams" (2005). The Space-Saving algorithm, often used in place of the CMS-plus-heap combination.
- Twitter Engineering — "Trending Topics". The closest public writeup of a real top-k system at scale.
- Jay Kreps — "Questioning the Lambda Architecture" (2014). The kappa-vs-lambda essay; required reading before defending either in a design round.
- Adjacent: Streaming aggregation. Windowing, watermarks, and exactly-once semantics.
- Adjacent: Bloom filter simulator. The probabilistic-structure intuition transfers directly.
- Adjacent: Caching strategies. How to think about the materialised top-k cache.