05 / 19

Playbook / 05

Chat & direct messaging

WhatsApp, Slack DM, Messenger, Discord — same skeleton. The shape that makes chat distinctive: tens of millions of concurrent persistent connections that the rest of the playbook never has to deal with. The interesting choices are how those connections terminate, how a sender finds a recipient, and what "delivered" actually means in the presence of network failure. Walk this once and you'll have answered questions that recur in any real-time-event design.

1 · Clarifying questions

User scale?	1B MAU, 200M DAU, 10M concurrent. The concurrency number drives the connection-layer design more than anything else.
DM only or groups too?	Both. 1:1, small groups (≤ 100), large groups (≤ 5,000). Bigger than 5K becomes a broadcast/feed problem.
Message rate?	50B messages/day across the system → ~580K msgs/s peak.
Average message size?	200 B text-only, ~5 KB with metadata. Media (images, voice) handled by separate object-storage subsystem.
Delivery guarantees?	At-least-once with idempotent dedupe at the client. Ordering preserved per conversation.
Read receipts & typing?	Yes — but as ephemeral signals, not durably stored. Best-effort delivery.
Push when offline?	Yes — APNs (iOS), FCM (Android), web push. Critical for engagement.
End-to-end encryption?	Out-of-band Signal-protocol-style. Server stores ciphertext only; metadata visible. Worth a separate page.
Latency budget?	P99 send-to-deliver ≤ 200 ms when both online. ≤ 5 s for push delivery to offline devices.
Multi-region?	Yes. Region-pinned conversations; cross-region for DMs spanning regions.

2 · Capacity math, on a napkin

Number	Calculation	Result
Concurrent connections	given	10M
Connections per gateway pod	~50K (good Linux + Go + epoll)	200 pods needed
RAM per connection	~10 KB (TLS state + buffer + user state)	~500 MB / pod
Send QPS (avg / peak)	50B / 86,400 × 3	~580K / ~1.7M
Fan-out amplification	1.2× (small DM bias) to 10× (groups)	~700K avg deliveries / s
Storage / message	~500 B (text + meta + indexing overhead)	~500 B
Storage / day	50B × 500 B	~25 TB / day
Storage / year	×365	~9 PB
Storage / 7 yr (with tiering)	cold tier after 90 d → ×2 effective	~63 PB compressed
Push fan-out (offline)	~30% of msgs → APNs/FCM	~175K push/s peak

The 10M concurrent connections is the architecture-defining number. Everything else flows from it: 200 gateway pods, ~$3M/month in compute alone, the per-connection budget that determines what state a pod can hold per user.

3 · API and data model

WebSocket frames (the hot path)

# After WS handshake + auth, every frame is a small JSON or Protobuf
# Use Protobuf for production. JSON shown here for readability.

# Send
→ { "op": "send", "msg_id": "<client-uuid>", "conv": "c_abc",
 "body": "...", "media": [...] }
# Server ack
← { "op": "ack", "msg_id": "<client-uuid>", "server_id": "m_x9Q2", "ts": 1746... }

# Inbound delivery
← { "op": "msg", "server_id": "m_x9Q2", "conv": "c_abc",
 "from": "u_alice", "body": "...", "ts": 1746... }
# Client ack (delivered)
→ { "op": "delivered", "server_id": "m_x9Q2" }
# Client read
→ { "op": "read", "server_id": "m_x9Q2" }

# Presence / typing — best effort
→ { "op": "typing", "conv": "c_abc" }
← { "op": "presence", "user": "u_bob", "state": "online" }

REST endpoints (cold-path supporting)

POST /v1/conversations # create DM or group
GET /v1/conversations # list user's conversations (recent)
GET /v1/conversations/:id/messages # paginated history (cursor-based)
POST /v1/conversations/:id/members # group: add members
DELETE /v1/conversations/:id/members/:uid

Storage

messages -- sharded by conversation_id
 conversation_id BIGINT
 message_id BIGINT (Snowflake) -- time-ordered
 sender_id BIGINT
 body BYTES -- Protobuf or ciphertext
 media_refs JSONB
 client_msg_id UUID -- dedupe key
 PRIMARY KEY ((conversation_id), message_id DESC)
 -- Cassandra-style: partition by conv, cluster by time DESC
 -- gives O(1) "newest N messages in conversation"

conversations
 conversation_id BIGINT PRIMARY KEY
 type VARCHAR(8) -- dm | group_small | group_large
 created_at TIMESTAMP
 members SET<BIGINT> -- inline if ≤ 100
 members_count INTEGER -- denormalized
 last_message_id BIGINT -- for "recent conversations"
 meta JSONB -- name, icon (groups)

membership -- reverse index
 user_id BIGINT
 conversation_id BIGINT
 joined_at TIMESTAMP
 last_read_id BIGINT
 muted BOOLEAN
 PRIMARY KEY ((user_id), conversation_id)

connection_registry -- in Redis, NOT durable
 user_id → set<gateway_pod_id, ws_connection_id, last_seen>

4 · High-level architecture

Three layers below the WebSocket gateway. Connection registry (who's connected where), message router (look up recipient → forward to their gateway), and the async lane (Kafka → push to APNs/FCM, write to messages KV). Group fan-out runs through the message bus rather than the connection registry.

5 · The hard part — routing a message to the recipient

Sender and receiver are usually on different gateway pods. The message-router has to find the right pod for the recipient and forward the frame.

Step	What happens
1	Alice's gateway receives the WS frame, validates auth and rate limit.
2	Gateway writes the message durably (Kafka with partition by `conversation_id`) and acks the sender.
3	Message-router consumes from Kafka, looks up the conversation's members in the membership store.
4	For each member, look up `connection_registry[user_id]` — list of gateway pods where that user has live WS connections.
5	Forward the frame to each of those pods (RPC over gRPC), which writes it onto the right WebSocket.
6	For members with no live connection: enqueue a push notification (APNs / FCM).
7	On client receipt, the gateway sends a "delivered" event back to the message router (durable receipt log).

Why Kafka in the middle. Two reasons. First, durability: a sender ack means the message is in Kafka, not just in the gateway's memory — so a gateway crash doesn't lose the message. Second, ordering: per-conversation Kafka partition means messages are delivered in send-order, full stop. Skip Kafka and you need to rebuild both properties from scratch.

6 · Connection lifecycle

Persistent WebSockets are a different operational beast than HTTP. The lifecycle determines what gets paged and when.

Sticky sessions at the L4 LB. WebSocket reconnects should land back on a healthy pod, but not necessarily the same one. The connection-registry is the source of truth.
Heartbeat and idle timeout. Client pings every 30 s; server tears down at 90 s without a ping. Mobile flake-rate is the dominant cause of "presence flap" — model it explicitly.
Graceful drain. Gateway pod restart: stop accepting new connections, wait 60 s, then close existing connections with a 1012 (Service Restart). Clients reconnect to a new pod within seconds.
Reconnect with backfill. On reconnect, client sends its last_message_id; server replays anything newer from Kafka or messages table. Bounded — a 24-hour gap means the client gets the most-recent N and a "load older" pagination link.
Multi-device. One user can have 5+ active connections (phone, tablet, web). The connection registry is a set, not a single value. Every device gets the message.

7 · Group fan-out

Three regimes:

Group size	Strategy	Storage
DM (2)	Direct route to the recipient's pod.	Single message row.
Small group (≤ 100)	Members list inline in conversation row. Router fans out to each member synchronously via Kafka.	Single message row + member list.
Large group (≤ 5K)	Members in separate `membership` table; fan-out worker batches recipients per gateway pod (often dozens of users on the same pod).	Single message row; per-user receipts written async.
Channels (> 5K)	Different design. Move to a Slack-style "channel feed" — closer to news-feed than chat. Out of scope for this page.	—

The pod-batching trick. A 1,000-member group with members spread across 50 gateway pods needs 50 RPCs, not 1,000. The router groups recipients by pod and sends one batch RPC per destination — dropping fan-out cost roughly 20×.

8 · Delivery semantics & dedupe

At-least-once with idempotent dedupe at the client. Every message has a client_msg_id (UUID) generated at send-time; the storage layer treats that as a unique key, so a retry produces no second insert.

Sender resends if it doesn't see a server ack within 3 s. Same client_msg_id; the messages table treats it as a duplicate write (UPSERT-style).
Recipient deduplicates on server_id at the client layer. The client app's local DB is the source of truth for "have I shown this".
Read receipts & presence are best-effort. No durability; a lost typing-event isn't worth a retry.
Order preserved per conversation. Kafka partition-per-conversation gives this for free.
Cross-device sync. Each device holds its own last_message_id per conversation; on reconnect, gateway replays from that mark.

9 · Failure modes & runbook

Failure	Symptom	Mitigation
Gateway pod crash	50K connections drop simultaneously	Clients auto-reconnect within 5 s; LB routes to fresh pods. Connection registry is rebuilt from new connections; no message loss because Kafka holds them.
Connection-registry (Redis) shard offline	Routing decisions for affected user range fail	In-memory pod-level cache absorbs the spike; recipient appears "offline" briefly so messages route to push. Recovery lag < 60 s.
Kafka broker dead	Per-partition writes fail until ISR rebuilds	Replication factor 3, ISR ≥ 2 enforced; producer retries with exponential backoff. Recipient sees a 1–3 s delivery delay.
APNs/FCM throttle	Push backlog grows on the worker	Per-provider token-bucket rate limit at the worker; spillover to email-summary if backlog > 5 min.
Hot conversation (a celebrity DMs everyone)	One Kafka partition saturating	Sub-partition by `(conversation, message_id mod K)` for known-large groups; trade ordering scope (per-sub-partition only).
Region partition	Cross-region DMs queued	Region-local writes always work; cross-region messages drain when partition heals. Recipient sees them in order on arrival.
"Storm of typing events"	Presence svc CPU saturating	Throttle per-user typing emissions to 1/s. Drop on overflow — the cost of dropping a typing event is zero.

10 · Cost & SLOs

Line	Estimate	Note
Gateway pods (200 × 4 vCPU × 8 GB)	~$30K / month	Reserved Linux + Go fleet
Message router (50 pods)	~$5K / month	Stateless
Kafka (per-region cluster)	~$15K / month	Replication × 3, retention 7 d
Cassandra (9 PB / yr × tiering)	~$80K / month	The dominant storage cost; cold tier dropping after 90 d
Redis (connection registry, presence)	~$10K / month	Sharded cluster across AZs
Push providers (APNs/FCM)	~$3K / month	Per-message billed
Total (per region)	~$143K / month	Two regions: ~$280K

SLOs

Send-to-deliver P99: 200 ms when both online. Edge → gateway → Kafka → router → recipient gateway → recipient WS.
Send-to-push P99: 5 s offline. APNs/FCM is the dominant variable.
Connection availability: 99.99%. Reconnect-within-10 s after pod failure considered "connected".
Message durability: 99.999%. Cassandra RF=3 + Kafka RF=3 inside the partition gives this.

11 · Trade-offs & "what would you change at 10×"

If…	Then…
10× concurrent (100M)	Move from 50K connections/pod to 200K with eBPF-based connection multiplexing. Different gateway architecture (per-CPU socket sharding).
End-to-end encryption	Server stores ciphertext only. Move push payloads to encrypted-blob references — APNs/FCM ferries metadata only. Adds a key-exchange dance (X3DH / Double Ratchet); complicates multi-device.
Strict global ordering across conversations	You don't want this. Force-pick per-conversation ordering only. Global ordering is what made early IRC servers fragile.
Voice / video calls	Different protocol stack (WebRTC, SFU). Reuse the signalling channel (WS over the same gateway) but the media plane is its own infrastructure.
Search across all conversations	Per-user inverted index built async from the message stream. Elasticsearch or Vespa per region. End-to-end encrypted variants get harder; client-side index is the way out.
"What would a more senior answer add?"	The compliance pipeline — message retention by jurisdiction, legal hold, takedown, the auditable log. Plus the operations: rolling upgrades of stateful gateway pods, the connection-drain dance, the regional-isolation drill that proves the failover works.

12 · Defending the decisions — interview pushback

Most senior interviewers don't ask "what would you build" — they push back on each choice and watch how cleanly you defend it. These are the five exchanges that recur for chat designs; have a one-sentence answer ready for each.

"Why WebSocket and not HTTP long-polling?" Bidirectional, low overhead, no repeated headers on every poll cycle. Long polling is a firewall-fallback, not a first choice for a system designed from scratch.
"Why Cassandra and not Postgres for messages?" The write rate (580K/s) and the access pattern (latest N per conversation) are exactly what Cassandra's clustering keys optimise for. Postgres would need a write-optimised partition strategy you'd have to build yourself.
"Why two Cassandra tables (messages-by-conv, messages-by-user)?" Conflicting read patterns. History view reads by conversation_id; offline sync and push read by user_id. Two tables, same data, different partition key — both patterns serve without a scatter-gather.
"Why Kafka in the middle and not a direct gRPC fan-out?" Durability (sender ack means it's in Kafka, not a gateway's RAM) and ordering (per-conversation partition is in-order for free). Skip Kafka and you rebuild both properties from scratch.
"What about offline sync — how do you avoid replaying a million messages?" Each device tracks last_message_id per conversation. On reconnect, the gateway replays from that mark only, with a hard cap (most-recent 200, then paginate). Design this on day one; retrofitting it is painful.

Chat & direct messaging

1 · Clarifying questions

2 · Capacity math, on a napkin

3 · API and data model

WebSocket frames (the hot path)

REST endpoints (cold-path supporting)

Storage

4 · High-level architecture

5 · The hard part — routing a message to the recipient

6 · Connection lifecycle

7 · Group fan-out

8 · Delivery semantics & dedupe

9 · Failure modes & runbook

10 · Cost & SLOs

SLOs

11 · Trade-offs & "what would you change at 10×"

12 · Defending the decisions — interview pushback

Further reading

Distributed rate limiter