Event ingestion

The canonical "ad-click counter" question, asked at billing-grade correctness. Every event has to land, every event has to be counted once, and the customer-facing dashboard has to be fresh enough to be useful. The shape of the answer is familiar — a Kafka tier, idempotency keys, a columnar store for rollups, a fraud filter — but the details are where the design earns its keep. Most candidates get the pipeline right and the accounting wrong.

1 · Clarifying questions

Five minutes. The correctness bar is the question that changes the architecture more than any other — pin it down first.

Events per second at peak?	Assume 1M events/sec sustained, 5× burst at peak hours. That puts the ingest tier in "Kafka, not RabbitMQ" territory and forces sharded consumers.
What's the correctness bar?	Three tiers: estimate (analytics dashboards), audited (internal reporting), billing-grade (advertisers get invoiced from these numbers). Assume billing-grade — it dominates the design.
Fraud filtering required?	Yes. Click fraud is ~10–20% of raw traffic in ads; uncaught fraud directly inflates customer bills. Filter at ingest, refine in batch.
Freshness budget for the customer dashboard?	~60 seconds for real-time numbers; final billing numbers can be hours late. Two tiers of freshness with different correctness bars.
Historical query horizon?	90 days hot (sub-second queries), 5 years warm (minute-class queries), beyond that archived. Tiering drives most of the storage cost.
Customer query shape?	Mostly rollups by hour or day, scoped to a campaign. Top-K queries (top spending advertisers, top-converting creatives). Ad-hoc drill-down is rare but must work.
Schema stability?	Events evolve — new fraud signals, new device fields. Schema registry with backward-compatible evolution required.
Multi-tenant isolation?	Yes, soft. One advertiser's runaway campaign should not block another's ingest. Per-tenant quotas at the edge.

2 · Capacity math, on a napkin

Storage is the dominant cost line; the math here decides whether the design is affordable. Run it before drawing boxes.

Number	Calculation	Result
Events / sec (sustained)	given	1M
Events / sec (peak)	5× sustained	5M
Bytes / event (avg)	raw JSON, post-validate	~200 B
Ingest bandwidth	1M × 200 B	~200 MB/sec
Raw / day	200 MB × 86,400	~17 TB
Raw / year	×365	~6 PB
Columnar compressed	~10× (typical ClickHouse)	~600 TB / year
Kafka retention (7 days)	17 TB × 7 × 3 (RF)	~360 TB
Hot rollup cache (Redis)	~1M campaigns × 24 hourly bins × 64 B	~1.5 GB / day; ~50 GB at 30 days
Consumer parallelism	200 MB/sec / 10 MB/sec per partition	~64–128 partitions per topic

6 PB/year of raw bytes is what shapes the architecture. Without 10× columnar compression the design isn't affordable. Without tiering (hot → warm → cold) the design isn't affordable at 5 years.

3 · API and data model

Batch ingest at the edge. Three storage shapes, each serving a different read pattern.

Endpoints

POST /v1/events                # batch ingest from advertiser SDKs / pixel
{"events":[
  {"event_id":"01HXY...", "advertiser_id":"A123", "campaign_id":"C9",
   "click_ts":1715990400123, "user_hash":"sha256:...",
   "ip":"203.0.113.7", "ua":"Mozilla/...", "referrer":"...",
   "fraud_signals":{"bot_score":0.04, "vpn":false}}
]}
→ 202 {"accepted":N, "rejected":[{"event_id":"...", "reason":"schema"}]}

GET /v1/reports/clicks         # customer-facing rollups
?campaign_id=C9&from=2026-05-17T00&to=2026-05-18T00&granularity=hour
→ 200 {"buckets":[{"ts":"2026-05-17T00", "clicks":12044, "billable":11890}, ...]}

GET /v1/reports/top            # top-K within window
?advertiser_id=A123&metric=spend&k=10&window=24h
→ 200 {"top":[{"campaign_id":"C9", "spend":1240.55}, ...]}

Schemas

-- raw events (immutable, append-only, columnar)
events_raw
  event_id      ULID PRIMARY KEY        -- client-supplied; idempotency key
  advertiser_id VARCHAR
  campaign_id   VARCHAR
  click_ts      TIMESTAMP(3)            -- ms precision
  ingest_ts     TIMESTAMP(3)            -- server time, for watermarking
  user_hash     CHAR(64)
  ip            INET
  ua            TEXT
  referrer      TEXT
  fraud_signals JSONB                   -- bot_score, vpn, ip_reputation, ...
  partition_key UINT32                  -- hash(campaign_id) % 128
  PARTITION BY toDate(click_ts), partition_key   -- ClickHouse-style

-- aggregated rollups (the customer query target)
clicks_by_hour
  campaign_id   VARCHAR
  hour          TIMESTAMP               -- truncated to hour, in UTC
  clicks_total  UINT64                  -- including suspected fraud
  clicks_billable UINT64                -- post fraud filter
  spend_micros  UINT64                  -- billable × CPC, in micro-units
  PRIMARY KEY (campaign_id, hour)

-- hot cache (Redis), TTL ~ 1 hour
rollup:<campaign_id>:<hour>  →  {clicks_total, clicks_billable, spend_micros}

The event_id is the idempotency key — client-generated, ULID for time-ordering. Consumers dedupe on it. The partition_key spreads a single hot campaign across multiple Kafka partitions; without it, one viral campaign starves every other consumer on that partition.

4 · High-level architecture

Edge in front, Kafka in the middle, two-tier storage at the back. The real-time path gets the customer their numbers within a minute; the batch path corrects them.

Edge validates and timestamps. Kafka is the durable spine — every event lands here before any storage. The realtime consumer maintains Redis rollups for the dashboard; the batch job is the source of truth for billing. Fraud signals flow in asynchronously and update the batch rollups.

5 · Deep dive — exactly-once accounting

This is the part most candidates fumble. "Exactly-once" as a delivery guarantee is famously contentious; the production pattern is more honest about what's actually achievable.

Kafka gives at-least-once delivery by default — a consumer can crash after processing a message but before committing its offset, and the message will be redelivered. The accepted answer is at-least-once delivery plus idempotency at the consumer. The client generates an event_id (a ULID); the consumer's write is keyed on it; a duplicate write is a no-op. The result is exactly-once effects, regardless of how many times the message is delivered.

For billing-grade correctness, idempotency at one tier isn't enough. The pattern is two-tier reconciliation: the real-time path is approximate and optimised for freshness; the batch path is authoritative and optimised for correctness. The customer dashboard shows the real-time number, usually within a few percent of the truth. A scheduled batch job (hourly or daily) recomputes the rollups from the raw event log and overwrites the rollup table. Customer-visible numbers shift slightly when the batch runs; that shift is what gets invoiced.

Why the real-time path can be approximate. A click that arrives 60 seconds late is not in the real-time rollup yet, but it is in the raw log. The real-time number undercounts. A duplicate click that slipped past dedupe overcounts. Both are corrected when the batch job recomputes from the raw log. The invariant is: the raw log is the ledger; everything else is a cache of it.

Concretely: the consumer does a conditional write into ClickHouse keyed on event_id (or maintains a Bloom filter + lookup table of recently-seen IDs). The batch job runs SELECT COUNT(DISTINCT event_id) ... GROUP BY campaign_id, hour from the raw table and overwrites clicks_by_hour. The dashboard reads from clicks_by_hour with a fallback to Redis for the current hour.

6 · Failure modes & runbook

Each of these has bitten a real ad network. The fixes are not exotic; the interesting part is recognising the failure before it hits production.

Failure	Symptom	Mitigation
Duplicate events from client retries	Inflated click counts; advertiser over-billed	Client-generated `event_id` (ULID); consumer dedupes on it. Keep a rolling 7-day dedupe window in the consumer.
Out-of-order events (late arrivals)	Real-time rollup undercounts; batch path corrects	Windowed processing with a watermark — e.g. close the 14:00 hour window at 14:15 in real-time, but the batch job re-reads everything within 24 hours.
Kafka partition skew	One campaign with 1000× more clicks saturates its partition; everyone behind it lags	Partition by `hash(campaign_id, sub_key)` with a sub-key that spreads hot campaigns across 8–16 partitions. Or use a separate "hot tenants" topic.
Click fraud (bots, click farms)	Inflated billable counts; advertiser disputes invoice	Rate-limit per IP and per user_hash at the edge. Anomaly detector flags suspicious patterns. Mark in `fraud_signals`; exclude from `clicks_billable`.
Real-time / batch discrepancy	Dashboard number jumps when batch lands	Surface both numbers in the UI ("real-time: 12,044 · billable: 11,890 as of 14:00"). Cap the allowed discrepancy and alert oncall if > 5%.
ClickHouse node failure	One shard slow; query API P99 rises	Replication factor 2+ per shard; query router fails over. Hedge queries to replicas above 200 ms.
Schema-incompatible event	Consumer crashes on parse	Schema registry enforces backward compatibility on producer registration. Unparseable events go to a dead-letter topic, never block the live consumer.
S3 cold-tier query during business hours	Customer dashboard slow on 1-year drill-down	Pre-aggregate to day granularity for warm tier; sub-day queries against cold tier are async and async-emailed.

7 · Cost & operability

Storage dominates. The ingest tier is CPU- and network-bound but cheap relative to the petabytes of retained data. A ballpark in 2026 cloud prices:

Line	Estimate	Note
Ingest service (100 pods)	~$5K / month	4 vCPU × 8 GB; managed K8s
Kafka cluster (12 brokers, 7d retention)	~$25K / month	~360 TB on NVMe; cross-AZ replication
Realtime consumers + fraud	~$8K / month	~50 pods, 8 vCPU each
ClickHouse (90d hot, ~150 TB compressed)	~$30K / month	12-node cluster, NVMe-backed
S3 (warm, ~3 PB)	~$60K / month	Standard tier; drops to ~$15K on Infrequent Access
S3 Glacier (cold, 5y archive)	~$5K / month	Retrieval is async
Redis (hot rollups, 50 GB)	~$2K / month	3-replica cluster
Total	~$135K / month	Dominated by warm storage; aggressive tiering cuts 30–50%

Tiering policy

7 days hot. ClickHouse, sub-second queries, full row detail.
90 days warm. ClickHouse + Parquet on S3 Standard, queries via the same engine, second-class latency.
5 years cold. Parquet on S3 Glacier or equivalent. Pre-aggregated to day granularity. Retrieval is async; customer-facing queries beyond 90 days return "ready in N minutes" and email a link.

8 · Trade-offs & what's next

The last five minutes of the interview. The questions below are the ones a strong interviewer reaches for once the core design holds together.

Question	How to answer
Kappa vs Lambda?	Lambda (separate real-time + batch paths) is what was described; Kappa (one stream-processing job that does both) is tidier but harder to operate. The honest answer in 2026: Kappa with Flink works for greenfield; Lambda is what most production systems still run because the batch path is easier to reason about for billing.
Schema evolution?	Avro or Protobuf with a schema registry (Confluent or self-hosted). Enforce backward compatibility on producer registration. Forward compatibility for consumers means new fields are optional and old consumers ignore them.
Real-time fraud vs batch fraud?	Real-time catches the obvious cases (rate-limits, known bad IPs) and protects ingest. Batch is where the heavy ML scores run — they're too expensive per-event but cheap in aggregate. Real-time prevents 80% of fraud spend; batch reclaims the rest in the daily reconciliation.
Per-tenant isolation?	Soft isolation via per-tenant rate limits at the edge and consumer-side quotas. Hard isolation (separate Kafka topics or clusters per large tenant) is reserved for the few customers whose volume justifies the operational overhead.
GDPR right-to-delete in a columnar store?	The hard part. Columnar stores are append-optimised; per-row delete is expensive. Standard pattern: tombstone the `user_hash` in a separate deletion log, filter at query time, and compact (rewrite the affected partitions) on a scheduled job. Some teams use partition-level deletes for users whose data fits a single partition.
What would you change at 10×?	Move the ingest tier to a sharded write-ahead log (e.g. Pulsar with tiered storage) so retention doesn't compete with throughput for disk. Push fraud scoring into the ingest pod so the consumer doesn't have to. Pre-aggregate at the producer for the highest-volume tenants.

Event ingestion

1 · Clarifying questions

2 · Capacity math, on a napkin

3 · API and data model

Endpoints

Schemas

4 · High-level architecture

5 · Deep dive — exactly-once accounting

6 · Failure modes & runbook

7 · Cost & operability

Tiering policy

8 · Trade-offs & what's next

Further reading

Object storage