Chat & direct messaging
WhatsApp, Slack DM, Messenger, Discord — same skeleton. The shape that makes chat distinctive: tens of millions of concurrent persistent connections that the rest of the playbook never has to deal with. The interesting choices are how those connections terminate, how a sender finds a recipient, and what "delivered" actually means in the presence of network failure. Walk this once and you'll have answered questions that recur in any real-time-event design.
1 · Clarifying questions
| User scale? | 1B MAU, 200M DAU, 10M concurrent. The concurrency number drives the connection-layer design more than anything else. |
| DM only or groups too? | Both. 1:1, small groups (≤ 100), large groups (≤ 5,000). Bigger than 5K becomes a broadcast/feed problem. |
| Message rate? | 50B messages/day across the system → ~580K msgs/s peak. |
| Average message size? | 200 B text-only, ~5 KB with metadata. Media (images, voice) handled by separate object-storage subsystem. |
| Delivery guarantees? | At-least-once with idempotent dedupe at the client. Ordering preserved per conversation. |
| Read receipts & typing? | Yes — but as ephemeral signals, not durably stored. Best-effort delivery. |
| Push when offline? | Yes — APNs (iOS), FCM (Android), web push. Critical for engagement. |
| End-to-end encryption? | Out-of-band Signal-protocol-style. Server stores ciphertext only; metadata visible. Worth a separate page. |
| Latency budget? | P99 send-to-deliver ≤ 200 ms when both online. ≤ 5 s for push delivery to offline devices. |
| Multi-region? | Yes. Region-pinned conversations; cross-region for DMs spanning regions. |
2 · Capacity math, on a napkin
| Number | Calculation | Result |
|---|---|---|
| Concurrent connections | given | 10M |
| Connections per gateway pod | ~50K (good Linux + Go + epoll) | 200 pods needed |
| RAM per connection | ~10 KB (TLS state + buffer + user state) | ~500 MB / pod |
| Send QPS (avg / peak) | 50B / 86,400 × 3 | ~580K / ~1.7M |
| Fan-out amplification | 1.2× (small DM bias) to 10× (groups) | ~700K avg deliveries / s |
| Storage / message | ~500 B (text + meta + indexing overhead) | ~500 B |
| Storage / day | 50B × 500 B | ~25 TB / day |
| Storage / year | ×365 | ~9 PB |
| Storage / 7 yr (with tiering) | cold tier after 90 d → ×2 effective | ~63 PB compressed |
| Push fan-out (offline) | ~30% of msgs → APNs/FCM | ~175K push/s peak |
The 10M concurrent connections is the architecture-defining number. Everything else flows from it: 200 gateway pods, ~$3M/month in compute alone, the per-connection budget that determines what state a pod can hold per user.
3 · API and data model
WebSocket frames (the hot path)
# After WS handshake + auth, every frame is a small JSON or Protobuf
# Use Protobuf for production. JSON shown here for readability.
# Send
→ { "op": "send", "msg_id": "<client-uuid>", "conv": "c_abc",
"body": "...", "media": [...] }
# Server ack
← { "op": "ack", "msg_id": "<client-uuid>", "server_id": "m_x9Q2", "ts": 1746... }
# Inbound delivery
← { "op": "msg", "server_id": "m_x9Q2", "conv": "c_abc",
"from": "u_alice", "body": "...", "ts": 1746... }
# Client ack (delivered)
→ { "op": "delivered", "server_id": "m_x9Q2" }
# Client read
→ { "op": "read", "server_id": "m_x9Q2" }
# Presence / typing — best effort
→ { "op": "typing", "conv": "c_abc" }
← { "op": "presence", "user": "u_bob", "state": "online" }REST endpoints (cold-path supporting)
POST /v1/conversations # create DM or group
GET /v1/conversations # list user's conversations (recent)
GET /v1/conversations/:id/messages # paginated history (cursor-based)
POST /v1/conversations/:id/members # group: add members
DELETE /v1/conversations/:id/members/:uidStorage
messages -- sharded by conversation_id
conversation_id BIGINT
message_id BIGINT (Snowflake) -- time-ordered
sender_id BIGINT
body BYTES -- Protobuf or ciphertext
media_refs JSONB
client_msg_id UUID -- dedupe key
PRIMARY KEY ((conversation_id), message_id DESC)
-- Cassandra-style: partition by conv, cluster by time DESC
-- gives O(1) "newest N messages in conversation"
conversations
conversation_id BIGINT PRIMARY KEY
type VARCHAR(8) -- dm | group_small | group_large
created_at TIMESTAMP
members SET<BIGINT> -- inline if ≤ 100
members_count INTEGER -- denormalized
last_message_id BIGINT -- for "recent conversations"
meta JSONB -- name, icon (groups)
membership -- reverse index
user_id BIGINT
conversation_id BIGINT
joined_at TIMESTAMP
last_read_id BIGINT
muted BOOLEAN
PRIMARY KEY ((user_id), conversation_id)
connection_registry -- in Redis, NOT durable
user_id → set<gateway_pod_id, ws_connection_id, last_seen>4 · High-level architecture
Three layers below the WebSocket gateway. Connection registry (who's connected where), message router (look up recipient → forward to their gateway), and the async lane (Kafka → push to APNs/FCM, write to messages KV). Group fan-out runs through the message bus rather than the connection registry.
5 · The hard part — routing a message to the recipient
Sender and receiver are usually on different gateway pods. The message-router has to find the right pod for the recipient and forward the frame.
| Step | What happens |
|---|---|
| 1 | Alice's gateway receives the WS frame, validates auth and rate limit. |
| 2 | Gateway writes the message durably (Kafka with partition by conversation_id) and acks the sender. |
| 3 | Message-router consumes from Kafka, looks up the conversation's members in the membership store. |
| 4 | For each member, look up connection_registry[user_id] — list of gateway pods where that user has live WS connections. |
| 5 | Forward the frame to each of those pods (RPC over gRPC), which writes it onto the right WebSocket. |
| 6 | For members with no live connection: enqueue a push notification (APNs / FCM). |
| 7 | On client receipt, the gateway sends a "delivered" event back to the message router (durable receipt log). |
6 · Connection lifecycle
Persistent WebSockets are a different operational beast than HTTP. The lifecycle determines what gets paged and when.
- Sticky sessions at the L4 LB. WebSocket reconnects should land back on a healthy pod, but not necessarily the same one. The connection-registry is the source of truth.
- Heartbeat and idle timeout. Client pings every 30 s; server tears down at 90 s without a ping. Mobile flake-rate is the dominant cause of "presence flap" — model it explicitly.
- Graceful drain. Gateway pod restart: stop accepting new connections, wait 60 s, then close existing connections with a 1012 (Service Restart). Clients reconnect to a new pod within seconds.
- Reconnect with backfill. On reconnect, client sends its
last_message_id; server replays anything newer from Kafka or messages table. Bounded — a 24-hour gap means the client gets the most-recent N and a "load older" pagination link. - Multi-device. One user can have 5+ active connections (phone, tablet, web). The connection registry is a set, not a single value. Every device gets the message.
7 · Group fan-out
Three regimes:
| Group size | Strategy | Storage |
|---|---|---|
| DM (2) | Direct route to the recipient's pod. | Single message row. |
| Small group (≤ 100) | Members list inline in conversation row. Router fans out to each member synchronously via Kafka. | Single message row + member list. |
| Large group (≤ 5K) | Members in separate membership table; fan-out worker batches recipients per gateway pod (often dozens of users on the same pod). | Single message row; per-user receipts written async. |
| Channels (> 5K) | Different design. Move to a Slack-style "channel feed" — closer to news-feed than chat. Out of scope for this page. | — |
8 · Delivery semantics & dedupe
At-least-once with idempotent dedupe at the client. Every message has a
client_msg_id (UUID) generated at send-time; the storage layer treats
that as a unique key, so a retry produces no second insert.
- Sender resends if it doesn't see a server ack within 3 s. Same
client_msg_id; the messages table treats it as a duplicate write (UPSERT-style). - Recipient deduplicates on
server_idat the client layer. The client app's local DB is the source of truth for "have I shown this". - Read receipts & presence are best-effort. No durability; a lost typing-event isn't worth a retry.
- Order preserved per conversation. Kafka partition-per-conversation gives this for free.
- Cross-device sync. Each device holds its own
last_message_idper conversation; on reconnect, gateway replays from that mark.
9 · Failure modes & runbook
| Failure | Symptom | Mitigation |
|---|---|---|
| Gateway pod crash | 50K connections drop simultaneously | Clients auto-reconnect within 5 s; LB routes to fresh pods. Connection registry is rebuilt from new connections; no message loss because Kafka holds them. |
| Connection-registry (Redis) shard offline | Routing decisions for affected user range fail | In-memory pod-level cache absorbs the spike; recipient appears "offline" briefly so messages route to push. Recovery lag < 60 s. |
| Kafka broker dead | Per-partition writes fail until ISR rebuilds | Replication factor 3, ISR ≥ 2 enforced; producer retries with exponential backoff. Recipient sees a 1–3 s delivery delay. |
| APNs/FCM throttle | Push backlog grows on the worker | Per-provider token-bucket rate limit at the worker; spillover to email-summary if backlog > 5 min. |
| Hot conversation (a celebrity DMs everyone) | One Kafka partition saturating | Sub-partition by (conversation, message_id mod K) for known-large groups; trade ordering scope (per-sub-partition only). |
| Region partition | Cross-region DMs queued | Region-local writes always work; cross-region messages drain when partition heals. Recipient sees them in order on arrival. |
| "Storm of typing events" | Presence svc CPU saturating | Throttle per-user typing emissions to 1/s. Drop on overflow — the cost of dropping a typing event is zero. |
10 · Cost & SLOs
| Line | Estimate | Note |
|---|---|---|
| Gateway pods (200 × 4 vCPU × 8 GB) | ~$30K / month | Reserved Linux + Go fleet |
| Message router (50 pods) | ~$5K / month | Stateless |
| Kafka (per-region cluster) | ~$15K / month | Replication × 3, retention 7 d |
| Cassandra (9 PB / yr × tiering) | ~$80K / month | The dominant storage cost; cold tier dropping after 90 d |
| Redis (connection registry, presence) | ~$10K / month | Sharded cluster across AZs |
| Push providers (APNs/FCM) | ~$3K / month | Per-message billed |
| Total (per region) | ~$143K / month | Two regions: ~$280K |
SLOs
- Send-to-deliver P99: 200 ms when both online. Edge → gateway → Kafka → router → recipient gateway → recipient WS.
- Send-to-push P99: 5 s offline. APNs/FCM is the dominant variable.
- Connection availability: 99.99%. Reconnect-within-10 s after pod failure considered "connected".
- Message durability: 99.999%. Cassandra RF=3 + Kafka RF=3 inside the partition gives this.
11 · Trade-offs & "what would you change at 10×"
| If… | Then… |
|---|---|
| 10× concurrent (100M) | Move from 50K connections/pod to 200K with eBPF-based connection multiplexing. Different gateway architecture (per-CPU socket sharding). |
| End-to-end encryption | Server stores ciphertext only. Move push payloads to encrypted-blob references — APNs/FCM ferries metadata only. Adds a key-exchange dance (X3DH / Double Ratchet); complicates multi-device. |
| Strict global ordering across conversations | You don't want this. Force-pick per-conversation ordering only. Global ordering is what made early IRC servers fragile. |
| Voice / video calls | Different protocol stack (WebRTC, SFU). Reuse the signalling channel (WS over the same gateway) but the media plane is its own infrastructure. |
| Search across all conversations | Per-user inverted index built async from the message stream. Elasticsearch or Vespa per region. End-to-end encrypted variants get harder; client-side index is the way out. |
| "What would a more senior answer add?" | The compliance pipeline — message retention by jurisdiction, legal hold, takedown, the auditable log. Plus the operations: rolling upgrades of stateful gateway pods, the connection-drain dance, the regional-isolation drill that proves the failover works. |
12 · Defending the decisions — interview pushback
Most senior interviewers don't ask "what would you build" — they push back on each choice and watch how cleanly you defend it. These are the five exchanges that recur for chat designs; have a one-sentence answer ready for each.
- "Why WebSocket and not HTTP long-polling?" Bidirectional, low overhead, no repeated headers on every poll cycle. Long polling is a firewall-fallback, not a first choice for a system designed from scratch.
- "Why Cassandra and not Postgres for messages?" The write rate (580K/s) and the access pattern (latest N per conversation) are exactly what Cassandra's clustering keys optimise for. Postgres would need a write-optimised partition strategy you'd have to build yourself.
- "Why two Cassandra tables (messages-by-conv, messages-by-user)?" Conflicting read patterns. History view reads by
conversation_id; offline sync and push read byuser_id. Two tables, same data, different partition key — both patterns serve without a scatter-gather. - "Why Kafka in the middle and not a direct gRPC fan-out?" Durability (sender ack means it's in Kafka, not a gateway's RAM) and ordering (per-conversation partition is in-order for free). Skip Kafka and you rebuild both properties from scratch.
- "What about offline sync — how do you avoid replaying a million messages?" Each device tracks
last_message_idper conversation. On reconnect, the gateway replays from that mark only, with a hard cap (most-recent 200, then paginate). Design this on day one; retrofitting it is painful.
Further reading
- Slack Engineering — "Real Time Messaging". Series of posts on Slack's WebSocket gateway architecture. The "flannel" service is the routing layer described above.
- Discord Engineering — "How Discord Stores Trillions of Messages". Cassandra-shaped storage at scale; the per-conversation partition pattern is the core lesson.
- Facebook Engineering — "Building Mobile-First Infrastructure for Messenger". The MQTT-based predecessor; useful as a counter-design.
- WhatsApp Engineering — "1 Million is Just the Beginning". Erlang and the BEAM VM as a connection multiplexer. Different stack, same problem.
- Signal — "Sealed Sender". The end-to-end encryption design that pushed the state of the art.
- Adjacent: TLS. The transport that secures every WebSocket frame.
- Adjacent: WebSockets. The protocol underneath the gateway.
- Adjacent: Message queues. Why Kafka is the right shape for the durable middle.