A notification system
Multi-channel — push, email, SMS — with the regulator who reads your privacy policy watching. The technical core is the fan-out queue and the per-channel adapter; what makes this an interesting question is the surrounding cluster of concerns: idempotency across retries, dedupe across rapid-fire events, channel selection when the user's preferences disagree with your defaults, regulated delivery (CAN-SPAM, GDPR, TCPA), and the unsubscribe flow that actually works.
1 · Clarifying questions
| What channels? | Push (APNs / FCM / web push), email (SES / SendGrid), SMS (Twilio / Pinpoint). The pipeline is generic; adapters are channel-specific. |
| Volume? | 200M users, 10 notifications/user/day on average → 2B/day across all channels. Email dominates, push is in the middle, SMS is rare and expensive. |
| Latency? | Transactional (login OTP, payment confirmation) ≤ 5 s P99. Marketing/digest can be batched (minutes to hours). |
| Reliability? | At-least-once with idempotent dedupe. Lost notifications are bad; double notifications are worse. |
| User preferences? | Per-user, per-category. The user can opt out of any category and any channel; we honour it within minutes. |
| Regulatory? | CAN-SPAM (US email), TCPA (US SMS), GDPR (EU), CASL (Canada). Unsubscribe within 10 days, suppression list, opt-in proofs. Real legal risk if we miss this. |
| Templates? | Yes — multi-language, variable substitution, A/B-able. Stored as versioned config; rendered at send-time. |
| Multi-region? | Yes. Region-pinned for data residency; provider selection by region (e.g., Twilio + local SMS aggregators). |
2 · Capacity math, on a napkin
| Number | Calculation | Result |
|---|---|---|
| Total notifications / day | given | 2B |
| Channel split | ~70% email · 25% push · 5% SMS | 1.4B / 500M / 100M |
| Notification QPS (avg / peak) | 2B / 86,400 × 5 | ~23K / ~120K |
| Email QPS peak | ×0.7 | ~80K |
| Push QPS peak | ×0.25 | ~30K |
| SMS QPS peak | ×0.05 | ~6K |
| Per-notification storage | id + payload + audit | ~1.5 KB |
| Storage / 90 d retention | 2B × 1.5 KB × 90 | ~270 TB |
| SMS cost | 100M × $0.005 | ~$500K / day → cost-controlled aggressively |
| Email cost (SES rate) | 1.4B/day × $0.0001 | ~$140K / day |
| Push cost | ~free at APNs/FCM | — |
The cost math says SMS dominates. Most of the design's cost discipline goes into "do not send an SMS the user could have received as a push" — channel selection is half the architecture.
3 · API and data model
Producer API (called by every other service)
POST /v1/notifications # send a notification
{
"user_id": "u_abc",
"category": "payment_confirmation", # required; routes to template + prefs
"params": {"amount": "$42.00", "merchant": "..."},
"idempotency_key": "pmt_xyz_2026-05-09", # client-provided; dedupe key
"channels": ["push", "email"], # optional override
"priority": "transactional" # transactional | bulk
}
→ 202 {"notification_id": "n_aB3x9Q2"}
POST /v1/notifications/bulk # marketing / digest fan-out
{
"audience_query": "active_users_who_purchased_last_7d",
"category": "weekly_digest",
"template_id": "digest_v3",
"schedule_at": "2026-05-10T09:00:00Z"
}
→ 202 {"job_id": "job_..."}User-facing endpoints (preferences)
GET /v1/users/:id/preferences # categories × channels
PUT /v1/users/:id/preferences/:category # opt in/out
GET /v1/users/:id/devices # registered tokens
POST /v1/users/:id/devices # register a new APNs/FCM token
DELETE /v1/users/:id/devices/:device_idStorage
notifications -- 90-day retention
notification_id BIGINT PRIMARY KEY
user_id BIGINT
category VARCHAR(64)
channels_planned JSONB -- e.g., ["push","email"]
channels_sent JSONB -- after delivery attempts
idempotency_key VARCHAR(64) UNIQUE -- producer-provided
payload JSONB
status VARCHAR(16) -- queued|sent|failed|suppressed
created_at TIMESTAMP
INDEX (user_id, created_at)
INDEX (idempotency_key)
user_preferences
user_id BIGINT
category VARCHAR(64)
channel VARCHAR(16)
enabled BOOLEAN
PRIMARY KEY ((user_id), category, channel)
devices -- push tokens
device_id UUID PRIMARY KEY
user_id BIGINT
platform VARCHAR(16) -- ios|android|web
token TEXT
last_seen TIMESTAMP
status VARCHAR(16) -- active|inactive|revoked
suppressions -- regulatory bounces
channel VARCHAR(16)
identifier VARCHAR(255) -- email | phone
reason VARCHAR(64) -- hard_bounce | unsubscribed | complained
added_at TIMESTAMP
PRIMARY KEY ((channel), identifier)4 · High-level architecture
Producers post to the API. The API does idempotency check, fetches the user's preferences and the template version, then publishes one Kafka event per (notification, channel) tuple. Dispatcher consumes, calls render-svc, hands off to the right channel adapter. Adapters own per-provider retry + rate-limiting + cost tracking.
5 · The hard part — channel selection & deduplication
Two flavours of "don't send this twice" land here. They look similar and are not.
Idempotency at the API
Every producer call carries an idempotency_key. The notifications
table has a unique index on it. A second call with the same key returns the
original notification — no second send. This catches retry storms from upstream
services.
Cross-event dedupe (semantic)
A user gets 5 likes in 30 seconds. We don't want to send 5 push notifications. The pattern is "coalescing" — the producer sends each event, the dispatcher groups recent same-category events for a user, and the adapter sends one consolidated notification at the end of a short window (10–30 s for push, 5 minutes for email).
# Coalescing window per (user, category)
key: coalesce:u_abc:likes
zset: { event_id_1: ts_1, event_id_2: ts_2, ... }
TTL: 60s
# Dispatcher logic
on receive event(user, category, ...):
ZADD coalesce:user:category event_id ts
if ZCARD = 1:
schedule send_now_or_after(window)
else:
no-op (already scheduled)
on timer fires:
events = ZRANGE coalesce:user:category 0 -1
ZDEL all
template = pick("you have 5 new likes" if events > 1 else "alice liked your post")
sendChannel selection
The decision tree at send time:
- Look up
user_preferences[user, category]. Filter channels to those the user has opted in to. - For each remaining channel, check
suppressions— is this email/phone hard-bounced or unsubscribed? Drop. - For transactional (priority), send to all remaining channels.
- For bulk, prefer cheap channels — push first, email if no push, SMS only as a deliberate product choice.
- If user is currently online (active push session in the last 5 min), suppress email for this notification — they'll see the push.
- Mark the chosen channels in
channels_planned; emit one Kafka event per channel.
6 · Retries, dead letters, and the bounce loop
Each adapter implements retry independently — the rates, error codes, and bounce semantics are all channel-specific.
| Channel | Retry policy | Bounce handling |
|---|---|---|
| Push | 1 immediate retry on transient APNs error. Drop after that — the message is stale. | "InvalidToken" → mark device inactive, retry on a different device. |
| Exponential backoff up to 24 h for soft bounces. Hard bounces — never retry. | SES bounce webhook → suppression list immediately. Complaint webhook → unsubscribe automatically. | |
| SMS | 1 immediate retry. SMS retry is expensive (real money) and the user usually doesn't need a duplicate. | Carrier reject → suppression. STOP keyword → unsubscribe + suppression. Bounce-rate alerts on the adapter. |
7 · Templates, A/B, and rendering
- Templates as config. Versioned in Git, deployed to a key-value store. Each notification carries a
template_idand the params; the renderer combines them. - Multi-language. Locale-keyed template variants. Locale is on the user record; fallback to English.
- A/B. Two template variants under one logical id; dispatcher hashes
(user_id, template_id)into a bucket. Outcomes (open, click, conversion) tracked back via a separate analytics pipeline. - Rendering at send-time. Don't pre-render at queue time — the user's locale or preference may have changed. Renderer is stateless; it fans out per request.
- Sanitisation. Strip HTML from any user-generated content interpolated into templates. The Comment-poster's name is the most common XSS vector for email bodies.
8 · Failure modes & runbook
| Failure | Symptom | Mitigation |
|---|---|---|
| Provider outage (e.g., SES) | Email queue grows; bounce/throttle errors spike | Failover to secondary provider (SendGrid). Producer-pushed circuit breaker; degraded-channel banner in admin UI. |
| APNs token-flood | FCM/APNs return rate-limit; throughput drops | Per-app rate-limit at the adapter; queue depth alarm; back off. |
| Producer flood (event storm) | Notification API saturating | Per-producer rate limit at the API; coalescing absorbs more; non-transactional category pushed to longer windows. |
| Idempotency-key collision | "This notification was already sent" — but actually a real second event | Idempotency keys must be globally unique per producer; namespace by producer. Audit when collisions detected. |
| Suppression list query slow | Dispatcher latency rises | Bloom-filter cached at the adapter; full check only on suspected bounces. |
| Bad template deployed | 50% of recipients receive broken email | Template canary deployment; auto-rollback on bounce-rate spike. Feature-flag every template change. |
| Cost runaway | SMS bill spikes 10× | Per-category daily budget alarm; auto-disable channel above threshold; on-call paged. |
| Region partition | Cross-region notifications queue | Region-pinned producer/dispatcher; cross-region drains when partition heals; recipients see notifications in order on arrival. |
9 · Cost & SLOs
| Line | Estimate | Note |
|---|---|---|
| Notification API + dispatcher (200 pods) | ~$8K / month | Stateless; auto-scaled |
| Render svc (50 pods) | ~$3K / month | Template fetch + interpolation |
| Channel adapters (3 fleets, 100 pods each) | ~$10K / month | One per channel |
| Kafka (3-region) | ~$10K / month | Per-channel partition keys |
| Postgres + Cassandra (notifications + prefs) | ~$15K / month | 270 TB notifications history; tier after 90 d |
| Email provider (SES) | ~$4.2M / month | 1.4B sends/day × 30 × $0.0001 |
| SMS provider (Twilio) | ~$15M / month | 100M sends × $0.005 — the dominant line |
| Push (APNs/FCM) | ~$0 | Free at our volume |
| Total ex-SMS | ~$4.25M / month | Add SMS for $15M unless aggressively gated |
SLOs
- Transactional P99: 5 s end-to-end. API → Kafka → dispatcher → adapter → provider → user device. Provider is the dominant cost.
- Bulk P99: 30 minutes. Coalescing + scheduled delivery; tighter for the digest jobs.
- Delivery rate: ≥ 95% per channel. Lower than 95% triggers reputation review.
- Bounce-to-suppression latency: P99 ≤ 60 s. Bounces must propagate before the next dispatch fires.
- Unsubscribe-to-stop: P99 ≤ 5 minutes. Regulatory; not an SLO we can negotiate.
10 · Trade-offs & "what would you change at 10×"
| If… | Then… |
|---|---|
| 10× volume (20B / day) | Region-pin everything; per-region provider pools; the SMS bill becomes the strategic question (push more transactional flows to push/email, gate marketing harder). |
| Real-time digest (sub-minute) | Move from cron-driven to event-driven coalescing windows; the Redis ZADD pattern above scales horizontally with consistent hashing on user_id. |
| Notification preferences by ML | Per-user model predicting "will they engage if we send this in this channel". Drops volume 30–50% at the cost of an inference dependency on the hot path. |
| End-to-end encrypted notifications | The push payload becomes a "fetch this when you open the app" tombstone. Render-svc on the device. Lose lock-screen previews; gain real privacy. |
| Self-hosted email/SMS | Reasonable above ~10B emails/year for cost; never reasonable for SMS (telco licensing). The hidden cost is sender reputation — buy it with a managed provider for the first decade. |
| "What would a more senior answer add?" | The strategy layer: cross-channel attribution, frequency capping across the entire user surface, the GDPR/CCPA data-deletion automation, the regulator audit trail. Plus the integration with the growth team's experimentation platform — most this designs leave A/B as a TODO; the next-level design treats the experiment platform as a first-class peer. |
Further reading
- Slack Engineering — "Reducing Slack's memory footprint". Tangential, but the section on push-fan-out is widely cited.
- Pinterest Engineering — "Building Pinterest's notification platform". The clearest public writeup of a multi-channel pipeline at scale; the abstraction layers map almost 1:1 onto this design.
- Lyft Engineering — "Asynchronous notifications at Lyft". The Kafka-fan-out architecture in production; useful operational details.
- Twilio — "Programmable Messaging API". Read the SMS API docs; the failure modes and cost model are foundational.
- Amazon SES — "Best practices for sending email". The bounce/complaint handling section is required reading for any email-sending system.
- FTC — "CAN-SPAM Act: A Compliance Guide for Business". Short. The mistake-cost is real — read it.
- Adjacent: Chat. Push-when-offline plumbing overlaps significantly with chat's offline path.
- Adjacent: Message queues. The Kafka layer.
- Adjacent: Idempotency. The producer-side pattern.