05 / 19
Playbook / 05

Chat & direct messaging

WhatsApp, Slack DM, Messenger, Discord — same skeleton. The shape that makes chat distinctive: tens of millions of concurrent persistent connections that the rest of the playbook never has to deal with. The interesting choices are how those connections terminate, how a sender finds a recipient, and what "delivered" actually means in the presence of network failure. Walk this once and you'll have answered questions that recur in any real-time-event design.


1 · Clarifying questions

User scale?1B MAU, 200M DAU, 10M concurrent. The concurrency number drives the connection-layer design more than anything else.
DM only or groups too?Both. 1:1, small groups (≤ 100), large groups (≤ 5,000). Bigger than 5K becomes a broadcast/feed problem.
Message rate?50B messages/day across the system → ~580K msgs/s peak.
Average message size?200 B text-only, ~5 KB with metadata. Media (images, voice) handled by separate object-storage subsystem.
Delivery guarantees?At-least-once with idempotent dedupe at the client. Ordering preserved per conversation.
Read receipts & typing?Yes — but as ephemeral signals, not durably stored. Best-effort delivery.
Push when offline?Yes — APNs (iOS), FCM (Android), web push. Critical for engagement.
End-to-end encryption?Out-of-band Signal-protocol-style. Server stores ciphertext only; metadata visible. Worth a separate page.
Latency budget?P99 send-to-deliver ≤ 200 ms when both online. ≤ 5 s for push delivery to offline devices.
Multi-region?Yes. Region-pinned conversations; cross-region for DMs spanning regions.

2 · Capacity math, on a napkin

NumberCalculationResult
Concurrent connectionsgiven10M
Connections per gateway pod~50K (good Linux + Go + epoll)200 pods needed
RAM per connection~10 KB (TLS state + buffer + user state)~500 MB / pod
Send QPS (avg / peak)50B / 86,400 × 3~580K / ~1.7M
Fan-out amplification1.2× (small DM bias) to 10× (groups)~700K avg deliveries / s
Storage / message~500 B (text + meta + indexing overhead)~500 B
Storage / day50B × 500 B~25 TB / day
Storage / year×365~9 PB
Storage / 7 yr (with tiering)cold tier after 90 d → ×2 effective~63 PB compressed
Push fan-out (offline)~30% of msgs → APNs/FCM~175K push/s peak

The 10M concurrent connections is the architecture-defining number. Everything else flows from it: 200 gateway pods, ~$3M/month in compute alone, the per-connection budget that determines what state a pod can hold per user.

3 · API and data model

WebSocket frames (the hot path)

# After WS handshake + auth, every frame is a small JSON or Protobuf
# Use Protobuf for production. JSON shown here for readability.

# Send
→ { "op": "send", "msg_id": "<client-uuid>", "conv": "c_abc",
 "body": "...", "media": [...] }
# Server ack
← { "op": "ack", "msg_id": "<client-uuid>", "server_id": "m_x9Q2", "ts": 1746... }

# Inbound delivery
← { "op": "msg", "server_id": "m_x9Q2", "conv": "c_abc",
 "from": "u_alice", "body": "...", "ts": 1746... }
# Client ack (delivered)
→ { "op": "delivered", "server_id": "m_x9Q2" }
# Client read
→ { "op": "read", "server_id": "m_x9Q2" }

# Presence / typing — best effort
→ { "op": "typing", "conv": "c_abc" }
← { "op": "presence", "user": "u_bob", "state": "online" }

REST endpoints (cold-path supporting)

POST /v1/conversations # create DM or group
GET /v1/conversations # list user's conversations (recent)
GET /v1/conversations/:id/messages # paginated history (cursor-based)
POST /v1/conversations/:id/members # group: add members
DELETE /v1/conversations/:id/members/:uid

Storage

messages -- sharded by conversation_id
 conversation_id BIGINT
 message_id BIGINT (Snowflake) -- time-ordered
 sender_id BIGINT
 body BYTES -- Protobuf or ciphertext
 media_refs JSONB
 client_msg_id UUID -- dedupe key
 PRIMARY KEY ((conversation_id), message_id DESC)
 -- Cassandra-style: partition by conv, cluster by time DESC
 -- gives O(1) "newest N messages in conversation"

conversations
 conversation_id BIGINT PRIMARY KEY
 type VARCHAR(8) -- dm | group_small | group_large
 created_at TIMESTAMP
 members SET<BIGINT> -- inline if ≤ 100
 members_count INTEGER -- denormalized
 last_message_id BIGINT -- for "recent conversations"
 meta JSONB -- name, icon (groups)

membership -- reverse index
 user_id BIGINT
 conversation_id BIGINT
 joined_at TIMESTAMP
 last_read_id BIGINT
 muted BOOLEAN
 PRIMARY KEY ((user_id), conversation_id)

connection_registry -- in Redis, NOT durable
 user_id → set<gateway_pod_id, ws_connection_id, last_seen>

4 · High-level architecture

Three layers below the WebSocket gateway. Connection registry (who's connected where), message router (look up recipient → forward to their gateway), and the async lane (Kafka → push to APNs/FCM, write to messages KV). Group fan-out runs through the message bus rather than the connection registry.

5 · The hard part — routing a message to the recipient

Sender and receiver are usually on different gateway pods. The message-router has to find the right pod for the recipient and forward the frame.

StepWhat happens
1Alice's gateway receives the WS frame, validates auth and rate limit.
2Gateway writes the message durably (Kafka with partition by conversation_id) and acks the sender.
3Message-router consumes from Kafka, looks up the conversation's members in the membership store.
4For each member, look up connection_registry[user_id] — list of gateway pods where that user has live WS connections.
5Forward the frame to each of those pods (RPC over gRPC), which writes it onto the right WebSocket.
6For members with no live connection: enqueue a push notification (APNs / FCM).
7On client receipt, the gateway sends a "delivered" event back to the message router (durable receipt log).
Why Kafka in the middle. Two reasons. First, durability: a sender ack means the message is in Kafka, not just in the gateway's memory — so a gateway crash doesn't lose the message. Second, ordering: per-conversation Kafka partition means messages are delivered in send-order, full stop. Skip Kafka and you need to rebuild both properties from scratch.

6 · Connection lifecycle

Persistent WebSockets are a different operational beast than HTTP. The lifecycle determines what gets paged and when.

  • Sticky sessions at the L4 LB. WebSocket reconnects should land back on a healthy pod, but not necessarily the same one. The connection-registry is the source of truth.
  • Heartbeat and idle timeout. Client pings every 30 s; server tears down at 90 s without a ping. Mobile flake-rate is the dominant cause of "presence flap" — model it explicitly.
  • Graceful drain. Gateway pod restart: stop accepting new connections, wait 60 s, then close existing connections with a 1012 (Service Restart). Clients reconnect to a new pod within seconds.
  • Reconnect with backfill. On reconnect, client sends its last_message_id; server replays anything newer from Kafka or messages table. Bounded — a 24-hour gap means the client gets the most-recent N and a "load older" pagination link.
  • Multi-device. One user can have 5+ active connections (phone, tablet, web). The connection registry is a set, not a single value. Every device gets the message.

7 · Group fan-out

Three regimes:

Group sizeStrategyStorage
DM (2)Direct route to the recipient's pod.Single message row.
Small group (≤ 100)Members list inline in conversation row. Router fans out to each member synchronously via Kafka.Single message row + member list.
Large group (≤ 5K)Members in separate membership table; fan-out worker batches recipients per gateway pod (often dozens of users on the same pod).Single message row; per-user receipts written async.
Channels (> 5K)Different design. Move to a Slack-style "channel feed" — closer to news-feed than chat. Out of scope for this page.
The pod-batching trick. A 1,000-member group with members spread across 50 gateway pods needs 50 RPCs, not 1,000. The router groups recipients by pod and sends one batch RPC per destination — dropping fan-out cost roughly 20×.

8 · Delivery semantics & dedupe

At-least-once with idempotent dedupe at the client. Every message has a client_msg_id (UUID) generated at send-time; the storage layer treats that as a unique key, so a retry produces no second insert.

  1. Sender resends if it doesn't see a server ack within 3 s. Same client_msg_id; the messages table treats it as a duplicate write (UPSERT-style).
  2. Recipient deduplicates on server_id at the client layer. The client app's local DB is the source of truth for "have I shown this".
  3. Read receipts & presence are best-effort. No durability; a lost typing-event isn't worth a retry.
  4. Order preserved per conversation. Kafka partition-per-conversation gives this for free.
  5. Cross-device sync. Each device holds its own last_message_id per conversation; on reconnect, gateway replays from that mark.

9 · Failure modes & runbook

FailureSymptomMitigation
Gateway pod crash50K connections drop simultaneouslyClients auto-reconnect within 5 s; LB routes to fresh pods. Connection registry is rebuilt from new connections; no message loss because Kafka holds them.
Connection-registry (Redis) shard offlineRouting decisions for affected user range failIn-memory pod-level cache absorbs the spike; recipient appears "offline" briefly so messages route to push. Recovery lag < 60 s.
Kafka broker deadPer-partition writes fail until ISR rebuildsReplication factor 3, ISR ≥ 2 enforced; producer retries with exponential backoff. Recipient sees a 1–3 s delivery delay.
APNs/FCM throttlePush backlog grows on the workerPer-provider token-bucket rate limit at the worker; spillover to email-summary if backlog > 5 min.
Hot conversation (a celebrity DMs everyone)One Kafka partition saturatingSub-partition by (conversation, message_id mod K) for known-large groups; trade ordering scope (per-sub-partition only).
Region partitionCross-region DMs queuedRegion-local writes always work; cross-region messages drain when partition heals. Recipient sees them in order on arrival.
"Storm of typing events"Presence svc CPU saturatingThrottle per-user typing emissions to 1/s. Drop on overflow — the cost of dropping a typing event is zero.

10 · Cost & SLOs

LineEstimateNote
Gateway pods (200 × 4 vCPU × 8 GB)~$30K / monthReserved Linux + Go fleet
Message router (50 pods)~$5K / monthStateless
Kafka (per-region cluster)~$15K / monthReplication × 3, retention 7 d
Cassandra (9 PB / yr × tiering)~$80K / monthThe dominant storage cost; cold tier dropping after 90 d
Redis (connection registry, presence)~$10K / monthSharded cluster across AZs
Push providers (APNs/FCM)~$3K / monthPer-message billed
Total (per region)~$143K / monthTwo regions: ~$280K

SLOs

  • Send-to-deliver P99: 200 ms when both online. Edge → gateway → Kafka → router → recipient gateway → recipient WS.
  • Send-to-push P99: 5 s offline. APNs/FCM is the dominant variable.
  • Connection availability: 99.99%. Reconnect-within-10 s after pod failure considered "connected".
  • Message durability: 99.999%. Cassandra RF=3 + Kafka RF=3 inside the partition gives this.

11 · Trade-offs & "what would you change at 10×"

If…Then…
10× concurrent (100M)Move from 50K connections/pod to 200K with eBPF-based connection multiplexing. Different gateway architecture (per-CPU socket sharding).
End-to-end encryptionServer stores ciphertext only. Move push payloads to encrypted-blob references — APNs/FCM ferries metadata only. Adds a key-exchange dance (X3DH / Double Ratchet); complicates multi-device.
Strict global ordering across conversationsYou don't want this. Force-pick per-conversation ordering only. Global ordering is what made early IRC servers fragile.
Voice / video callsDifferent protocol stack (WebRTC, SFU). Reuse the signalling channel (WS over the same gateway) but the media plane is its own infrastructure.
Search across all conversationsPer-user inverted index built async from the message stream. Elasticsearch or Vespa per region. End-to-end encrypted variants get harder; client-side index is the way out.
"What would a more senior answer add?"The compliance pipeline — message retention by jurisdiction, legal hold, takedown, the auditable log. Plus the operations: rolling upgrades of stateful gateway pods, the connection-drain dance, the regional-isolation drill that proves the failover works.

12 · Defending the decisions — interview pushback

Most senior interviewers don't ask "what would you build" — they push back on each choice and watch how cleanly you defend it. These are the five exchanges that recur for chat designs; have a one-sentence answer ready for each.

  1. "Why WebSocket and not HTTP long-polling?" Bidirectional, low overhead, no repeated headers on every poll cycle. Long polling is a firewall-fallback, not a first choice for a system designed from scratch.
  2. "Why Cassandra and not Postgres for messages?" The write rate (580K/s) and the access pattern (latest N per conversation) are exactly what Cassandra's clustering keys optimise for. Postgres would need a write-optimised partition strategy you'd have to build yourself.
  3. "Why two Cassandra tables (messages-by-conv, messages-by-user)?" Conflicting read patterns. History view reads by conversation_id; offline sync and push read by user_id. Two tables, same data, different partition key — both patterns serve without a scatter-gather.
  4. "Why Kafka in the middle and not a direct gRPC fan-out?" Durability (sender ack means it's in Kafka, not a gateway's RAM) and ordering (per-conversation partition is in-order for free). Skip Kafka and you rebuild both properties from scratch.
  5. "What about offline sync — how do you avoid replaying a million messages?" Each device tracks last_message_id per conversation. On reconnect, the gateway replays from that mark only, with a hard cap (most-recent 200, then paginate). Design this on day one; retrofitting it is painful.

Further reading

  • Slack Engineering — "Real Time Messaging". Series of posts on Slack's WebSocket gateway architecture. The "flannel" service is the routing layer described above.
  • Discord Engineering — "How Discord Stores Trillions of Messages". Cassandra-shaped storage at scale; the per-conversation partition pattern is the core lesson.
  • Facebook Engineering — "Building Mobile-First Infrastructure for Messenger". The MQTT-based predecessor; useful as a counter-design.
  • WhatsApp Engineering — "1 Million is Just the Beginning". Erlang and the BEAM VM as a connection multiplexer. Different stack, same problem.
  • Signal — "Sealed Sender". The end-to-end encryption design that pushed the state of the art.
  • Adjacent: TLS. The transport that secures every WebSocket frame.
  • Adjacent: WebSockets. The protocol underneath the gateway.
  • Adjacent: Message queues. Why Kafka is the right shape for the durable middle.
Found this useful?