11 / 19

Playbook / 11

Top-k / trending

The problem the interviewer phrases as "trending tweets" or "the 100 most-viewed videos in the last hour". Underneath, it is a question about counting at a scale where exact counts stop being practical — and about reconciling a fast, approximate real-time path with a slower, exact batch path. The core ideas are count-min sketches, a small heap of heavy hitters, and a decay window that defines what "now" means.

1 · Clarifying questions

The word "top" is doing a lot of work in the prompt. Five minutes of questions usually settles the design before any boxes go on the board.

What does "top" mean?	Raw count of events, or a weighted engagement score (like + 3× retweet + 5× reply)? Different aggregation primitive; sometimes a different sketch.
What's the time window?	Last hour, last 24 hours, since midnight? Sliding (continuous) or tumbling (per fixed bucket)? Sliding is harder; tumbling composes from per-minute counts.
How accurate does it need to be?	Exact ordering, or "95% of the time the top-100 is correct"? Approximate counters trade memory for a known error bound.
How many distinct items?	Cardinality M drives the sketch size. 1M items behaves very differently from 1B.
How many top results?	k = 10 fits trivially in memory. k = 100K is a different problem — usually backed by a sorted index, not a heap.
Freshness budget?	Real-time (sub-second), near-real-time (a minute late), or batch (hourly)? Each tier removes a layer of the architecture.
Personalised or global?	Global is one query. Per-user trending is a different shape — usually a candidate generator plus a ranking model, out of scope here.
Read traffic on the answer?	If the top-k page itself is hot, cache the materialised answer per window rather than recomputing on every request.

2 · Capacity math, on a napkin

Pick numbers that justify the sketch. If the answer fits in a hash map, you don't need any of this — say so out loud, then build the version that doesn't.

Number	Calculation	Result
Events / sec (peak)	given — likes, views, clicks	~500M
Distinct items in window	active tweets / videos	~100M
k (top results)	given	100
Bytes / event	id + type + ts	~50 B
Ingest bandwidth	500M × 50 B	~25 GB/sec
Exact counter table	100M items × ~256 B row	~25 GB per window
Count-min sketch	5 hashes × ~2M cells × 8 B	~80 MB per window
Heavy-hitters heap	1000 candidates × ~64 B	~64 KB per window
Storage horizon, hot	per-minute rollups, 7 days	~10,000 buckets
Storage horizon, warm	per-hour rollups, 90 days	~2,160 buckets

The two numbers that matter are 25 GB/sec of ingest and 100M distinct items per window. The first sets the scale of the ingest tier; the second is why we reach for a sketch rather than a hash map of exact counters.

3 · API and data model

The wire surface is small. The state behind it is where the design lives.

Endpoints

POST /v1/event              # fire-and-forget, batched by client
{"item_id":"t_8AQ9", "type":"view", "ts":1715990400}
→ 202

GET  /v1/top?window=1h&k=100   # the read path
→ 200 {"window":"1h","computed_at":"2026-05-18T12:03:00Z",
       "results":[{"item_id":"t_8AQ9","score":4129312}, ...]}

GET  /v1/top?window=24h&k=10&segment=us-en
→ 200 {...}                    # optional segment dimension

State

per-window state (real-time path)
  cms              count-min sketch       # fixed-size matrix of counters
  candidates       min-heap of size k+δ   # heavy-hitters set
  decay            per-bucket timestamps  # for sliding-window roll-off

materialised answer
  top_k:{window}   list[item_id, score]   # cached, regenerated every N seconds

historical store (batch path)
  events_raw                              # all events, partitioned by minute
  rollup_1m, rollup_1h                    # exact counts, computed offline

The read path never touches the sketch directly. It reads the materialised top-k list, which the real-time worker regenerates from the sketch and the heavy-hitter heap on a fixed cadence (every 5–15 seconds is typical).

4 · High-level architecture

Four tiers. Ingest absorbs the firehose. Real-time workers maintain the sketch and the heap. A historical store keeps exact counts for backfill and audit. A small query API serves the cached answer.

Real-time and batch run side by side. The real-time worker maintains the sketch and emits the materialised top-k into Redis. The batch path reads the raw event log from Kafka into ClickHouse, computes exact rollups, and reconciles the real-time numbers within a few minutes.

5 · The hard part — count-min sketch and heavy hitters

Exact counting at this scale is impractical: 100M distinct items per window times a counter cell is ~25 GB of state, replicated, per shard, per window. The count-min sketch trades that for a fixed-size matrix and a known overestimate.

A count-min sketch is a matrix of d rows by w columns of integer counters, plus d independent hash functions. To record an event for item x, increment cell[i][h_i(x) mod w] for every row. To estimate the count of x, take the minimum across the d cells it hashed into. The minimum is the tightest bound because every collision is an overestimate, never an underestimate.

# count-min sketch — increment and query
class CMS:
    def __init__(self, d=5, w=2_000_000):
        self.d, self.w = d, w
        self.cells = [[0] * w for _ in range(d)]
        self.seeds = [random_seed() for _ in range(d)]

    def add(self, x, weight=1):
        for i in range(self.d):
            j = hash(self.seeds[i], x) % self.w
            self.cells[i][j] += weight

    def estimate(self, x):
        return min(self.cells[i][hash(self.seeds[i], x) % self.w]
                   for i in range(self.d))

With w = 2M and d = 5, the error bound is roughly 2N / w with probability 1 - (1/2)^d. For 500M events/sec into a one-hour window, that's an overestimate of a few thousand per item with ~97% confidence — small relative to the millions of counts that the top-100 items actually carry.

The sketch alone doesn't tell you who is in the top-k; it only answers point queries. The heavy-hitters trick is to keep a small min-heap of size k + δ alongside the sketch. On every increment, look up the item's current estimate; if it exceeds the heap's minimum, promote it into the heap and evict the current minimum. The heap converges on the true top-k as long as δ is generous enough to absorb the sketch's overestimate.

Why two structures and not one. The sketch is excellent at "how many times did I see this item" and useless at "give me the items I've seen the most". The heap is the opposite. Together they cover both queries with bounded memory; either alone falls short.

Decay and windowing

For tumbling windows, keep a separate sketch per bucket; drop the bucket when it rolls off. For sliding windows, the cleanest pattern is a ring of small sketches (per-minute) that the worker sums when serving a query — at the cost of a small additional error from summing independent over-estimates. Exponential decay (multiply every counter by 0.99 per tick) is tempting but tends to bias toward recent items in a way that's hard to reason about under a sketch's error bound.

6 · Failure modes & runbook

What goes wrong, what oncall sees, what the runbook says. The interview-room version of the operational story.

Failure	Symptom	Mitigation
Sketch poisoning	An attacker engineers item IDs that all collide on the same cells, inflating their estimate	Seed the hash functions per shard at startup; rotate seeds on suspicion; rate-limit per source ID upstream of the sketch.
Counter overflow	A viral item drives a cell beyond 2³²; further increments wrap to zero	Use 64-bit counters by default; alert on any cell crossing 80% of the cap; auto-promote the window to wider counters.
Lost events	Worker crashes mid-batch; Kafka redelivers, counts double	Dedupe by event ID in a per-window Bloom filter; idempotent increments keyed on (item_id, event_id).
Backfill drift	The batch-computed top-100 disagrees with the real-time top-100	Expected within the sketch's error bound. Alert only when the symmetric difference exceeds a threshold (say, 5 items). Investigate as a correctness bug, not a glitch.
Heap saturation	Heap candidates churn faster than the sketch can rank them; top-k oscillates	Increase `δ` (oversample the heap); add hysteresis so a candidate must beat the threshold for N consecutive ticks before promotion.
Kafka partition skew	One item drives most traffic; one worker pegs at 100% CPU	Partition by hash(item_id) by default; for known hot items, fall back to round-robin and reconcile via a secondary sketch.
Cache stale on query	The materialised top-k is older than the freshness SLO	Workers heartbeat the cache write; query API serves a "stale" flag past 30 s and falls back to a previous window.

7 · Cost & operability

The pleasant surprise of the sketch is that its memory is fixed regardless of how many distinct items show up. A few hundred MB per window covers the entire workload. The bill is dominated by the ingest tier and the historical store.

Line	Estimate	Note
Kafka (500M ev/s, 7-day retention)	~$40K / month	Largest line. Compression and protobuf encoding push it down ~3×.
Real-time workers	~$8K / month	~200 pods; sketch + heap fits in < 1 GB each.
Sketch memory	negligible	~80 MB × ~24 windows × replicas — fits in worker RAM.
Batch tier (Spark / Flink)	~$6K / month	Per-minute rollups; idle most of the time.
ClickHouse (rollups, 90 days)	~$5K / month	Columnar; compresses ~10× on event data.
Top-k cache (Redis)	~$200 / month	A few MB per window; trivial.
Total	~$60K / month	Per million MAU, similar to a heavy timeline product.

Tuning the sketch dimensions is the main operability lever. Wider sketches (larger w) reduce the error proportionally but cost RAM linearly; more rows (larger d) tighten the confidence bound but cost extra hash work per event. Most production systems land near d = 5 and pick w to deliver the accuracy the product actually asks for.

8 · Trade-offs & what's next

The interesting follow-ups split into architectural choices and product extensions.

If…	Then…
You want one pipeline instead of two	The kappa architecture: keep only the streaming pipeline, replay history through it when the schema or aggregation changes. Lower operational surface; harder backfills.
You want reconciliation guarantees	The lambda architecture: serve the real-time top-k now, overwrite it with the batch result within a few minutes. Two paths, two codepaths, two opportunities to drift.
The product wants personalised top-k	The global sketch becomes a candidate generator; a per-user ranking model reorders. That's a different system — a feature store and a model-serving tier replace the heap.
The product wants per-segment top-k	Maintain one sketch per segment (country, language, topic). Memory scales linearly with the segment count; cap segments aggressively.
Exact counts are required	Drop the sketch. Use a sharded counter table with a hash map per shard. Capacity math changes by orders of magnitude; the product asks for it less often than you'd think.
"What would a more senior answer add?"	The anti-abuse layer: bot filtering, source-IP rate limits, signal de-weighting. Top-k rankings are an attractive target; the sketch's correctness depends on the inputs being honest.

Top-k / trending

1 · Clarifying questions

2 · Capacity math, on a napkin

3 · API and data model

Endpoints

State

4 · High-level architecture

5 · The hard part — count-min sketch and heavy hitters

Decay and windowing

6 · Failure modes & runbook

7 · Cost & operability

8 · Trade-offs & what's next

Further reading

Distributed scheduler