11 / 19
Playbook / 11

Top-k / trending

The problem the interviewer phrases as "trending tweets" or "the 100 most-viewed videos in the last hour". Underneath, it is a question about counting at a scale where exact counts stop being practical — and about reconciling a fast, approximate real-time path with a slower, exact batch path. The core ideas are count-min sketches, a small heap of heavy hitters, and a decay window that defines what "now" means.


1 · Clarifying questions

The word "top" is doing a lot of work in the prompt. Five minutes of questions usually settles the design before any boxes go on the board.

What does "top" mean?Raw count of events, or a weighted engagement score (like + 3× retweet + 5× reply)? Different aggregation primitive; sometimes a different sketch.
What's the time window?Last hour, last 24 hours, since midnight? Sliding (continuous) or tumbling (per fixed bucket)? Sliding is harder; tumbling composes from per-minute counts.
How accurate does it need to be?Exact ordering, or "95% of the time the top-100 is correct"? Approximate counters trade memory for a known error bound.
How many distinct items?Cardinality M drives the sketch size. 1M items behaves very differently from 1B.
How many top results?k = 10 fits trivially in memory. k = 100K is a different problem — usually backed by a sorted index, not a heap.
Freshness budget?Real-time (sub-second), near-real-time (a minute late), or batch (hourly)? Each tier removes a layer of the architecture.
Personalised or global?Global is one query. Per-user trending is a different shape — usually a candidate generator plus a ranking model, out of scope here.
Read traffic on the answer?If the top-k page itself is hot, cache the materialised answer per window rather than recomputing on every request.

2 · Capacity math, on a napkin

Pick numbers that justify the sketch. If the answer fits in a hash map, you don't need any of this — say so out loud, then build the version that doesn't.

NumberCalculationResult
Events / sec (peak)given — likes, views, clicks~500M
Distinct items in windowactive tweets / videos~100M
k (top results)given100
Bytes / eventid + type + ts~50 B
Ingest bandwidth500M × 50 B~25 GB/sec
Exact counter table100M items × ~256 B row~25 GB per window
Count-min sketch5 hashes × ~2M cells × 8 B~80 MB per window
Heavy-hitters heap1000 candidates × ~64 B~64 KB per window
Storage horizon, hotper-minute rollups, 7 days~10,000 buckets
Storage horizon, warmper-hour rollups, 90 days~2,160 buckets

The two numbers that matter are 25 GB/sec of ingest and 100M distinct items per window. The first sets the scale of the ingest tier; the second is why we reach for a sketch rather than a hash map of exact counters.

3 · API and data model

The wire surface is small. The state behind it is where the design lives.

Endpoints

POST /v1/event              # fire-and-forget, batched by client
{"item_id":"t_8AQ9", "type":"view", "ts":1715990400}
→ 202

GET  /v1/top?window=1h&k=100   # the read path
→ 200 {"window":"1h","computed_at":"2026-05-18T12:03:00Z",
       "results":[{"item_id":"t_8AQ9","score":4129312}, ...]}

GET  /v1/top?window=24h&k=10&segment=us-en
→ 200 {...}                    # optional segment dimension

State

per-window state (real-time path)
  cms              count-min sketch       # fixed-size matrix of counters
  candidates       min-heap of size k+δ   # heavy-hitters set
  decay            per-bucket timestamps  # for sliding-window roll-off

materialised answer
  top_k:{window}   list[item_id, score]   # cached, regenerated every N seconds

historical store (batch path)
  events_raw                              # all events, partitioned by minute
  rollup_1m, rollup_1h                    # exact counts, computed offline

The read path never touches the sketch directly. It reads the materialised top-k list, which the real-time worker regenerates from the sketch and the heavy-hitter heap on a fixed cadence (every 5–15 seconds is typical).

4 · High-level architecture

Four tiers. Ingest absorbs the firehose. Real-time workers maintain the sketch and the heap. A historical store keeps exact counts for backfill and audit. A small query API serves the cached answer.

Real-time and batch run side by side. The real-time worker maintains the sketch and emits the materialised top-k into Redis. The batch path reads the raw event log from Kafka into ClickHouse, computes exact rollups, and reconciles the real-time numbers within a few minutes.

5 · The hard part — count-min sketch and heavy hitters

Exact counting at this scale is impractical: 100M distinct items per window times a counter cell is ~25 GB of state, replicated, per shard, per window. The count-min sketch trades that for a fixed-size matrix and a known overestimate.

A count-min sketch is a matrix of d rows by w columns of integer counters, plus d independent hash functions. To record an event for item x, increment cell[i][h_i(x) mod w] for every row. To estimate the count of x, take the minimum across the d cells it hashed into. The minimum is the tightest bound because every collision is an overestimate, never an underestimate.

# count-min sketch — increment and query
class CMS:
    def __init__(self, d=5, w=2_000_000):
        self.d, self.w = d, w
        self.cells = [[0] * w for _ in range(d)]
        self.seeds = [random_seed() for _ in range(d)]

    def add(self, x, weight=1):
        for i in range(self.d):
            j = hash(self.seeds[i], x) % self.w
            self.cells[i][j] += weight

    def estimate(self, x):
        return min(self.cells[i][hash(self.seeds[i], x) % self.w]
                   for i in range(self.d))

With w = 2M and d = 5, the error bound is roughly 2N / w with probability 1 - (1/2)^d. For 500M events/sec into a one-hour window, that's an overestimate of a few thousand per item with ~97% confidence — small relative to the millions of counts that the top-100 items actually carry.

The sketch alone doesn't tell you who is in the top-k; it only answers point queries. The heavy-hitters trick is to keep a small min-heap of size k + δ alongside the sketch. On every increment, look up the item's current estimate; if it exceeds the heap's minimum, promote it into the heap and evict the current minimum. The heap converges on the true top-k as long as δ is generous enough to absorb the sketch's overestimate.

Why two structures and not one. The sketch is excellent at "how many times did I see this item" and useless at "give me the items I've seen the most". The heap is the opposite. Together they cover both queries with bounded memory; either alone falls short.

Decay and windowing

For tumbling windows, keep a separate sketch per bucket; drop the bucket when it rolls off. For sliding windows, the cleanest pattern is a ring of small sketches (per-minute) that the worker sums when serving a query — at the cost of a small additional error from summing independent over-estimates. Exponential decay (multiply every counter by 0.99 per tick) is tempting but tends to bias toward recent items in a way that's hard to reason about under a sketch's error bound.

6 · Failure modes & runbook

What goes wrong, what oncall sees, what the runbook says. The interview-room version of the operational story.

FailureSymptomMitigation
Sketch poisoningAn attacker engineers item IDs that all collide on the same cells, inflating their estimateSeed the hash functions per shard at startup; rotate seeds on suspicion; rate-limit per source ID upstream of the sketch.
Counter overflowA viral item drives a cell beyond 232; further increments wrap to zeroUse 64-bit counters by default; alert on any cell crossing 80% of the cap; auto-promote the window to wider counters.
Lost eventsWorker crashes mid-batch; Kafka redelivers, counts doubleDedupe by event ID in a per-window Bloom filter; idempotent increments keyed on (item_id, event_id).
Backfill driftThe batch-computed top-100 disagrees with the real-time top-100Expected within the sketch's error bound. Alert only when the symmetric difference exceeds a threshold (say, 5 items). Investigate as a correctness bug, not a glitch.
Heap saturationHeap candidates churn faster than the sketch can rank them; top-k oscillatesIncrease δ (oversample the heap); add hysteresis so a candidate must beat the threshold for N consecutive ticks before promotion.
Kafka partition skewOne item drives most traffic; one worker pegs at 100% CPUPartition by hash(item_id) by default; for known hot items, fall back to round-robin and reconcile via a secondary sketch.
Cache stale on queryThe materialised top-k is older than the freshness SLOWorkers heartbeat the cache write; query API serves a "stale" flag past 30 s and falls back to a previous window.

7 · Cost & operability

The pleasant surprise of the sketch is that its memory is fixed regardless of how many distinct items show up. A few hundred MB per window covers the entire workload. The bill is dominated by the ingest tier and the historical store.

LineEstimateNote
Kafka (500M ev/s, 7-day retention)~$40K / monthLargest line. Compression and protobuf encoding push it down ~3×.
Real-time workers~$8K / month~200 pods; sketch + heap fits in < 1 GB each.
Sketch memorynegligible~80 MB × ~24 windows × replicas — fits in worker RAM.
Batch tier (Spark / Flink)~$6K / monthPer-minute rollups; idle most of the time.
ClickHouse (rollups, 90 days)~$5K / monthColumnar; compresses ~10× on event data.
Top-k cache (Redis)~$200 / monthA few MB per window; trivial.
Total~$60K / monthPer million MAU, similar to a heavy timeline product.

Tuning the sketch dimensions is the main operability lever. Wider sketches (larger w) reduce the error proportionally but cost RAM linearly; more rows (larger d) tighten the confidence bound but cost extra hash work per event. Most production systems land near d = 5 and pick w to deliver the accuracy the product actually asks for.

8 · Trade-offs & what's next

The interesting follow-ups split into architectural choices and product extensions.

If…Then…
You want one pipeline instead of twoThe kappa architecture: keep only the streaming pipeline, replay history through it when the schema or aggregation changes. Lower operational surface; harder backfills.
You want reconciliation guaranteesThe lambda architecture: serve the real-time top-k now, overwrite it with the batch result within a few minutes. Two paths, two codepaths, two opportunities to drift.
The product wants personalised top-kThe global sketch becomes a candidate generator; a per-user ranking model reorders. That's a different system — a feature store and a model-serving tier replace the heap.
The product wants per-segment top-kMaintain one sketch per segment (country, language, topic). Memory scales linearly with the segment count; cap segments aggressively.
Exact counts are requiredDrop the sketch. Use a sharded counter table with a hash map per shard. Capacity math changes by orders of magnitude; the product asks for it less often than you'd think.
"What would a more senior answer add?"The anti-abuse layer: bot filtering, source-IP rate limits, signal de-weighting. Top-k rankings are an attractive target; the sketch's correctness depends on the inputs being honest.

Further reading

  • Cormode & Muthukrishnan — "An Improved Data Stream Summary: The Count-Min Sketch and its Applications" (2005). The original paper. Short and readable; the bound proofs are worth the hour.
  • Manku & Motwani — "Approximate Frequency Counts over Data Streams" (2002). The lossy-counting algorithm; an alternative to CMS + heap with a different error model.
  • Metwally, Agrawal, El Abbadi — "Efficient Computation of Frequent and Top-k Elements in Data Streams" (2005). The Space-Saving algorithm, often used in place of the CMS-plus-heap combination.
  • Twitter Engineering — "Trending Topics". The closest public writeup of a real top-k system at scale.
  • Jay Kreps — "Questioning the Lambda Architecture" (2014). The kappa-vs-lambda essay; required reading before defending either in a design round.
  • Adjacent: Streaming aggregation. Windowing, watermarks, and exactly-once semantics.
  • Adjacent: Bloom filter simulator. The probabilistic-structure intuition transfers directly.
  • Adjacent: Caching strategies. How to think about the materialised top-k cache.
Next · 12 / 19

Distributed scheduler

Running millions of jobs across a fleet — leader election, work stealing, fairness, and the failure modes you only see at scale.

Open the next walkthrough
Found this useful?