07 / 19
Playbook / 07

A notification system

Multi-channel — push, email, SMS — with the regulator who reads your privacy policy watching. The technical core is the fan-out queue and the per-channel adapter; what makes this an interesting question is the surrounding cluster of concerns: idempotency across retries, dedupe across rapid-fire events, channel selection when the user's preferences disagree with your defaults, regulated delivery (CAN-SPAM, GDPR, TCPA), and the unsubscribe flow that actually works.


1 · Clarifying questions

What channels?Push (APNs / FCM / web push), email (SES / SendGrid), SMS (Twilio / Pinpoint). The pipeline is generic; adapters are channel-specific.
Volume?200M users, 10 notifications/user/day on average → 2B/day across all channels. Email dominates, push is in the middle, SMS is rare and expensive.
Latency?Transactional (login OTP, payment confirmation) ≤ 5 s P99. Marketing/digest can be batched (minutes to hours).
Reliability?At-least-once with idempotent dedupe. Lost notifications are bad; double notifications are worse.
User preferences?Per-user, per-category. The user can opt out of any category and any channel; we honour it within minutes.
Regulatory?CAN-SPAM (US email), TCPA (US SMS), GDPR (EU), CASL (Canada). Unsubscribe within 10 days, suppression list, opt-in proofs. Real legal risk if we miss this.
Templates?Yes — multi-language, variable substitution, A/B-able. Stored as versioned config; rendered at send-time.
Multi-region?Yes. Region-pinned for data residency; provider selection by region (e.g., Twilio + local SMS aggregators).

2 · Capacity math, on a napkin

NumberCalculationResult
Total notifications / daygiven2B
Channel split~70% email · 25% push · 5% SMS1.4B / 500M / 100M
Notification QPS (avg / peak)2B / 86,400 × 5~23K / ~120K
Email QPS peak×0.7~80K
Push QPS peak×0.25~30K
SMS QPS peak×0.05~6K
Per-notification storageid + payload + audit~1.5 KB
Storage / 90 d retention2B × 1.5 KB × 90~270 TB
SMS cost100M × $0.005~$500K / day → cost-controlled aggressively
Email cost (SES rate)1.4B/day × $0.0001~$140K / day
Push cost~free at APNs/FCM

The cost math says SMS dominates. Most of the design's cost discipline goes into "do not send an SMS the user could have received as a push" — channel selection is half the architecture.

3 · API and data model

Producer API (called by every other service)

POST /v1/notifications # send a notification
{
 "user_id": "u_abc",
 "category": "payment_confirmation", # required; routes to template + prefs
 "params": {"amount": "$42.00", "merchant": "..."},
 "idempotency_key": "pmt_xyz_2026-05-09", # client-provided; dedupe key
 "channels": ["push", "email"], # optional override
 "priority": "transactional" # transactional | bulk
}
→ 202 {"notification_id": "n_aB3x9Q2"}

POST /v1/notifications/bulk # marketing / digest fan-out
{
 "audience_query": "active_users_who_purchased_last_7d",
 "category": "weekly_digest",
 "template_id": "digest_v3",
 "schedule_at": "2026-05-10T09:00:00Z"
}
→ 202 {"job_id": "job_..."}

User-facing endpoints (preferences)

GET /v1/users/:id/preferences # categories × channels
PUT /v1/users/:id/preferences/:category # opt in/out
GET /v1/users/:id/devices # registered tokens
POST /v1/users/:id/devices # register a new APNs/FCM token
DELETE /v1/users/:id/devices/:device_id

Storage

notifications -- 90-day retention
 notification_id BIGINT PRIMARY KEY
 user_id BIGINT
 category VARCHAR(64)
 channels_planned JSONB -- e.g., ["push","email"]
 channels_sent JSONB -- after delivery attempts
 idempotency_key VARCHAR(64) UNIQUE -- producer-provided
 payload JSONB
 status VARCHAR(16) -- queued|sent|failed|suppressed
 created_at TIMESTAMP
 INDEX (user_id, created_at)
 INDEX (idempotency_key)

user_preferences
 user_id BIGINT
 category VARCHAR(64)
 channel VARCHAR(16)
 enabled BOOLEAN
 PRIMARY KEY ((user_id), category, channel)

devices -- push tokens
 device_id UUID PRIMARY KEY
 user_id BIGINT
 platform VARCHAR(16) -- ios|android|web
 token TEXT
 last_seen TIMESTAMP
 status VARCHAR(16) -- active|inactive|revoked

suppressions -- regulatory bounces
 channel VARCHAR(16)
 identifier VARCHAR(255) -- email | phone
 reason VARCHAR(64) -- hard_bounce | unsubscribed | complained
 added_at TIMESTAMP
 PRIMARY KEY ((channel), identifier)

4 · High-level architecture

Producers post to the API. The API does idempotency check, fetches the user's preferences and the template version, then publishes one Kafka event per (notification, channel) tuple. Dispatcher consumes, calls render-svc, hands off to the right channel adapter. Adapters own per-provider retry + rate-limiting + cost tracking.

5 · The hard part — channel selection & deduplication

Two flavours of "don't send this twice" land here. They look similar and are not.

Idempotency at the API

Every producer call carries an idempotency_key. The notifications table has a unique index on it. A second call with the same key returns the original notification — no second send. This catches retry storms from upstream services.

Cross-event dedupe (semantic)

A user gets 5 likes in 30 seconds. We don't want to send 5 push notifications. The pattern is "coalescing" — the producer sends each event, the dispatcher groups recent same-category events for a user, and the adapter sends one consolidated notification at the end of a short window (10–30 s for push, 5 minutes for email).

# Coalescing window per (user, category)
key: coalesce:u_abc:likes
zset: { event_id_1: ts_1, event_id_2: ts_2, ... }
TTL: 60s

# Dispatcher logic
on receive event(user, category, ...):
 ZADD coalesce:user:category event_id ts
 if ZCARD = 1:
 schedule send_now_or_after(window)
 else:
 no-op (already scheduled)

on timer fires:
 events = ZRANGE coalesce:user:category 0 -1
 ZDEL all
 template = pick("you have 5 new likes" if events > 1 else "alice liked your post")
 send

Channel selection

The decision tree at send time:

  1. Look up user_preferences[user, category]. Filter channels to those the user has opted in to.
  2. For each remaining channel, check suppressions — is this email/phone hard-bounced or unsubscribed? Drop.
  3. For transactional (priority), send to all remaining channels.
  4. For bulk, prefer cheap channels — push first, email if no push, SMS only as a deliberate product choice.
  5. If user is currently online (active push session in the last 5 min), suppress email for this notification — they'll see the push.
  6. Mark the chosen channels in channels_planned; emit one Kafka event per channel.

6 · Retries, dead letters, and the bounce loop

Each adapter implements retry independently — the rates, error codes, and bounce semantics are all channel-specific.

ChannelRetry policyBounce handling
Push 1 immediate retry on transient APNs error. Drop after that — the message is stale. "InvalidToken" → mark device inactive, retry on a different device.
Email Exponential backoff up to 24 h for soft bounces. Hard bounces — never retry. SES bounce webhook → suppression list immediately. Complaint webhook → unsubscribe automatically.
SMS 1 immediate retry. SMS retry is expensive (real money) and the user usually doesn't need a duplicate. Carrier reject → suppression. STOP keyword → unsubscribe + suppression. Bounce-rate alerts on the adapter.
The bounce loop is the operations story. Bounce webhook → suppression list → next time the dispatcher checks before sending. If you skip the loop, you keep emailing dead addresses, your sender reputation drops, ESP throttles you, delivery rate falls 30%. Run a synthetic monitoring suite that proves bounces actually reach the suppression table.

7 · Templates, A/B, and rendering

  • Templates as config. Versioned in Git, deployed to a key-value store. Each notification carries a template_id and the params; the renderer combines them.
  • Multi-language. Locale-keyed template variants. Locale is on the user record; fallback to English.
  • A/B. Two template variants under one logical id; dispatcher hashes (user_id, template_id) into a bucket. Outcomes (open, click, conversion) tracked back via a separate analytics pipeline.
  • Rendering at send-time. Don't pre-render at queue time — the user's locale or preference may have changed. Renderer is stateless; it fans out per request.
  • Sanitisation. Strip HTML from any user-generated content interpolated into templates. The Comment-poster's name is the most common XSS vector for email bodies.

8 · Failure modes & runbook

FailureSymptomMitigation
Provider outage (e.g., SES)Email queue grows; bounce/throttle errors spikeFailover to secondary provider (SendGrid). Producer-pushed circuit breaker; degraded-channel banner in admin UI.
APNs token-floodFCM/APNs return rate-limit; throughput dropsPer-app rate-limit at the adapter; queue depth alarm; back off.
Producer flood (event storm)Notification API saturatingPer-producer rate limit at the API; coalescing absorbs more; non-transactional category pushed to longer windows.
Idempotency-key collision"This notification was already sent" — but actually a real second eventIdempotency keys must be globally unique per producer; namespace by producer. Audit when collisions detected.
Suppression list query slowDispatcher latency risesBloom-filter cached at the adapter; full check only on suspected bounces.
Bad template deployed50% of recipients receive broken emailTemplate canary deployment; auto-rollback on bounce-rate spike. Feature-flag every template change.
Cost runawaySMS bill spikes 10×Per-category daily budget alarm; auto-disable channel above threshold; on-call paged.
Region partitionCross-region notifications queueRegion-pinned producer/dispatcher; cross-region drains when partition heals; recipients see notifications in order on arrival.

9 · Cost & SLOs

LineEstimateNote
Notification API + dispatcher (200 pods)~$8K / monthStateless; auto-scaled
Render svc (50 pods)~$3K / monthTemplate fetch + interpolation
Channel adapters (3 fleets, 100 pods each)~$10K / monthOne per channel
Kafka (3-region)~$10K / monthPer-channel partition keys
Postgres + Cassandra (notifications + prefs)~$15K / month270 TB notifications history; tier after 90 d
Email provider (SES)~$4.2M / month1.4B sends/day × 30 × $0.0001
SMS provider (Twilio)~$15M / month100M sends × $0.005 — the dominant line
Push (APNs/FCM)~$0Free at our volume
Total ex-SMS~$4.25M / monthAdd SMS for $15M unless aggressively gated

SLOs

  • Transactional P99: 5 s end-to-end. API → Kafka → dispatcher → adapter → provider → user device. Provider is the dominant cost.
  • Bulk P99: 30 minutes. Coalescing + scheduled delivery; tighter for the digest jobs.
  • Delivery rate: ≥ 95% per channel. Lower than 95% triggers reputation review.
  • Bounce-to-suppression latency: P99 ≤ 60 s. Bounces must propagate before the next dispatch fires.
  • Unsubscribe-to-stop: P99 ≤ 5 minutes. Regulatory; not an SLO we can negotiate.

10 · Trade-offs & "what would you change at 10×"

If…Then…
10× volume (20B / day)Region-pin everything; per-region provider pools; the SMS bill becomes the strategic question (push more transactional flows to push/email, gate marketing harder).
Real-time digest (sub-minute)Move from cron-driven to event-driven coalescing windows; the Redis ZADD pattern above scales horizontally with consistent hashing on user_id.
Notification preferences by MLPer-user model predicting "will they engage if we send this in this channel". Drops volume 30–50% at the cost of an inference dependency on the hot path.
End-to-end encrypted notificationsThe push payload becomes a "fetch this when you open the app" tombstone. Render-svc on the device. Lose lock-screen previews; gain real privacy.
Self-hosted email/SMSReasonable above ~10B emails/year for cost; never reasonable for SMS (telco licensing). The hidden cost is sender reputation — buy it with a managed provider for the first decade.
"What would a more senior answer add?"The strategy layer: cross-channel attribution, frequency capping across the entire user surface, the GDPR/CCPA data-deletion automation, the regulator audit trail. Plus the integration with the growth team's experimentation platform — most this designs leave A/B as a TODO; the next-level design treats the experiment platform as a first-class peer.

Further reading

  • Slack Engineering — "Reducing Slack's memory footprint". Tangential, but the section on push-fan-out is widely cited.
  • Pinterest Engineering — "Building Pinterest's notification platform". The clearest public writeup of a multi-channel pipeline at scale; the abstraction layers map almost 1:1 onto this design.
  • Lyft Engineering — "Asynchronous notifications at Lyft". The Kafka-fan-out architecture in production; useful operational details.
  • Twilio — "Programmable Messaging API". Read the SMS API docs; the failure modes and cost model are foundational.
  • Amazon SES — "Best practices for sending email". The bounce/complaint handling section is required reading for any email-sending system.
  • FTC — "CAN-SPAM Act: A Compliance Guide for Business". Short. The mistake-cost is real — read it.
  • Adjacent: Chat. Push-when-offline plumbing overlaps significantly with chat's offline path.
  • Adjacent: Message queues. The Kafka layer.
  • Adjacent: Idempotency. The producer-side pattern.
Found this useful?