15 / 19

Playbook / 15

Design Twitter

A 250M-DAU read-heavy social feed. The hard parts aren't the writes — tweeting is rare and cheap. The hard parts are the timeline (sub-100 ms for billions of users), the celebrity problem (one writer with 100M followers should not crash the pipeline), and ranking (chronological is the easy answer, "what you'd want to see" is the real one).

1 · Clarifying questions

Functional scope?	Post a tweet (text + media), follow/unfollow, view home timeline, view user profile timeline, search. No DMs, no spaces, no ads.
Tweet size?	280 characters of text, plus up to 4 images (or one video). Text is small; media dominates storage.
Scale?	250M DAU. ~500M tweets/day (~5K writes/sec average, 25K peak). ~25B timeline reads/day (~290K reads/sec, 1.5M peak). Read:write ratio close to 50:1.
Latency?	Timeline load: P99 ≤ 200 ms. Tweet post: P99 ≤ 1 s (the user sees their own tweet immediately — fan-out can run async).
Ordering?	Mostly reverse-chronological with a ranking layer. Strict "ranked first, time second" is the default.
Consistency?	The user always sees their own tweet immediately (session consistency on writes). Other users see new tweets within seconds (eventual).
Multi-region?	Yes. Region-local writes with async cross-region replication. Reads served from the closest region.

2 · Capacity math, on a napkin

Number	Calculation	Result
DAU	given	250M
Tweets/day	250M × 2 per active user	~500M
Tweet write QPS (avg / peak)	500M / 86,400 × 5	~5K / ~25K
Timeline read QPS (avg / peak)	250M × 100 reads / 86,400 × 5	~290K / ~1.5M
Read:write ratio	—	~50:1
Tweet payload (text + meta)	~500 B (280 chars + ids + timestamps)	~500 B
Daily text storage	500M × 500 B × 3 (repl)	~700 GB/day → ~250 TB/year
Media storage	30% of tweets × 1.5 MB avg × 3	~700 TB/day raw → object storage tier
Follow graph edges	250M × 250 avg follows	~60B edges, ~5 TB
Timeline cache (in-memory)	250M × 800 recent tweet ids × 8 B	~1.5 TB across the timeline tier
Egress bandwidth (peak)	1.5M reads × ~20 tweets × ~600 B avg	~150 Gbps from origin (CDN absorbs most)

The headline: read traffic is 50× write traffic, so the whole architecture is built to keep reads fast and cheap. Tweets are tiny; media is large; the follow graph is enormous; timelines live in RAM.

3 · API and data model

Public API

POST /v1/tweets                       # post a tweet
{ "text": "...", "media_ids": [...], "in_reply_to": null }
→ 201 { "tweet_id": "t_8aB3x9Q2", "created_at": "..." }

GET  /v1/users/:id/timeline            # home timeline (default cursor-paginated)
  ?cursor=t_8aB3x9Q2&limit=50
→ 200 { "tweets": [...], "next_cursor": "..." }

GET  /v1/users/:id/tweets              # profile timeline (user's own posts)
POST /v1/users/:id/follow              # idempotent
DELETE /v1/users/:id/follow

GET  /v1/search?q=...&type=top|latest  # search

Storage

tweets                              -- Manhattan / sharded MySQL by user_id
  tweet_id     BIGINT PRIMARY KEY  -- Snowflake; time-ordered
  user_id      BIGINT              -- author; shard key
  text         VARCHAR(280)
  media_ids    JSONB
  reply_to     BIGINT NULL
  retweet_of   BIGINT NULL
  created_at   TIMESTAMP
  INDEX (user_id, created_at)

follow_graph                        -- adjacency list, sharded by follower
  follower_id  BIGINT
  followee_id  BIGINT
  created_at   TIMESTAMP
  PRIMARY KEY ((follower_id), followee_id)

home_timeline                       -- Redis sorted set per user
  user:{id}:home → ZSET<tweet_id, score>
                   (kept to ~800 ids; trimmed on insert)

user_timeline                       -- Redis sorted set per author
  user:{id}:tweets → ZSET<tweet_id, created_at>
                     (kept to ~3200 ids; older paged from MySQL)

4 · High-level architecture

Tweet service writes the tweet to MySQL and emits a Kafka event. Fan-out workers consume the event, look up the author's followers from the graph store, and push the tweet_id onto each follower's home-timeline ZSET in Redis. Timeline service reads from Redis on demand. The dashed line is the lazy path for celebrities — see §5.

5 · The hard part — fan-out and the celebrity problem

This is the decision the room is testing. Three shapes; one wins.

Fan-out on read (pull)

At read time, for each user the timeline service fetches the latest tweets from every followee, merges them, and returns. Beautifully simple. Falls over the moment a user follows 1,000 people — that's 1,000 fan-out queries per timeline read, multiplied by 1.5M reads/sec.

Fan-out on write (push)

At write time, push the new tweet_id into every follower's pre-computed timeline. Reads are then a single Redis ZRANGE — sub-millisecond. The cost moves to the write path: average write fan-out is ~250 (median follower count). Still cheap, because writes are 50× rarer than reads. Until you hit a celebrity.

The celebrity problem

A user with 100M followers tweets once. Fan-out-on-write means 100M Redis writes for that one tweet. The fan-out worker pool stalls, every other user's tweet sits behind it in the queue, the timeline freshness SLO breaks for everyone.

The fix is a hybrid: fan-out-on-write for the long tail (≤100K followers), and fan-out-on-read for celebrities (>100K followers). At read time, the timeline service does the ZRANGE for the precomputed part and also reads the recent tweets directly from the celebrities the user follows, then merges. Slightly more expensive per read, but the load is bounded — a user follows at most a handful of celebrities, so the read-side merge is small.

# Fan-out worker (long tail)
on event tweet_posted(author_id, tweet_id):
  followers = graph.get_followers(author_id)
  if len(followers) <= CELEBRITY_THRESHOLD:
    for f in followers:
      redis.zadd(f"user:{f}:home", {tweet_id: tweet_id})
      redis.zremrangebyrank(f"user:{f}:home", 0, -801)  # trim to 800
  else:
    # Mark author as celebrity; skip push. Reads will pull.
    redis.sadd("celebrities", author_id)

# Timeline read (hybrid merge)
def get_home(user_id, limit=50):
  precomputed = redis.zrevrange(f"user:{user_id}:home", 0, limit*2)
  celebs_followed = graph.get_followees(user_id) & redis.smembers("celebrities")
  recent_celeb_tweets = []
  for c in celebs_followed:
    recent_celeb_tweets += redis.zrevrange(f"user:{c}:tweets", 0, 20)
  merged = merge_by_score(precomputed, recent_celeb_tweets)
  return hydrate(merged[:limit])

Ranking, briefly

The score in the ZSET is not just timestamp. It's a weighted blend — recency, engagement signals from the author, the user's interaction history with this author, plus a content score from an offline ML pipeline. The ranking service writes the score into the ZSET on insert; reads return things already in the right order. The expensive part is the offline pipeline, not the read path.

6 · Search

Search at this volume is a different system — Earlybird (Twitter's in-house inverted-index engine, Lucene-derived) shards tweets by tweet_id and keeps the last 7 days of tweets in RAM across the cluster. Older tweets live on disk-backed shards, queried only when the user explicitly asks for older results.

Indexer. Consumes the same Kafka tweet topic; writes one document per tweet to the appropriate shard.
Query path. Scatter-gather across all shards in the cluster, top-k merge at the coordinator. With 7 days × ~5K tweets/sec = ~3B documents per shard set, each query touches ~50 shards.
"Top" vs "Latest". Two ranking passes. Latest is sorted by timestamp; Top runs the same ranker as the home timeline against a wider candidate set.

7 · Failure modes & runbook

Failure	Symptom	Mitigation
Celebrity threshold mistuned	Fan-out workers backlog grows; timeline freshness SLO breaks	Auto-tune threshold from observed worker queue depth. Pager fires at >90 s lag.
Hot user shard	One MySQL shard at 100% CPU; profile timeline reads slow	Split the shard. Read replicas absorb spike in the meantime. Aim for <500 GB per shard.
Redis node failure	~1% of users see empty home timelines briefly	Redis cluster with replication; failover under 30 s. Cold-start rebuilds the ZSET from MySQL on demand.
Trending event (election, finals)	Write QPS spikes 10×; fan-out queue backs up	Auto-scale fan-out workers. Temporarily raise celebrity threshold to push more authors to lazy path. Defer ranking pass on the spike.
Region partition	Tweets posted in EU not visible in US for some users	Each region has its own MySQL primary; cross-region replication via Kafka. Tolerated lag in the seconds.
Search index corruption	Search returns missing results for a window	Replay the Kafka tweet topic into the rebuilding shard. SLO is "indexed within 30 s of post"; corruption recovery measured in hours not days.
Spam wave	Bot accounts post millions of tweets per minute	Pre-fan-out spam classifier in the tweet service. Suspect tweets bypass fan-out; held for review. Account-level rate limits at the API.

8 · Cost & SLOs

Line	Estimate	Note
Tweet + timeline services (2K pods)	~$80K/month	Stateless; auto-scaled
Fan-out workers (~500 pods)	~$20K/month	Bursty; sized for peak
MySQL (sharded, ~500 shards)	~$200K/month	Tweets + reply graph
Redis timeline tier (~1.5 TB across cluster)	~$80K/month	In-memory; the read path lives here
Graph store	~$60K/month	60B edges; sharded by follower
Search cluster (Earlybird-shape)	~$150K/month	7 days hot in RAM
Object storage + CDN (media)	~$500K/month	Bandwidth dominates; CDN absorbs ~95%
Kafka (multi-region)	~$80K/month	Tweet topic + fan-out + indexer + ranking

SLOs

Timeline read P99: 200 ms. Cache hit on Redis; cold reads slower but rare.
Tweet post P99: 1 s. Including media upload; fan-out is async.
Fan-out freshness P99: 5 s. Time from tweet posted to tweet visible in followers' timelines.
Search freshness P99: 30 s. Time from tweet posted to indexed.
Availability: 99.95%. ~22 minutes/month of allowable downtime, region-isolated.

9 · Trade-offs & "what would you change at 10×"

If…	Then…
10× users (2.5B DAU)	Region count grows from 4 to 10+. Timeline cache size becomes painful; consider per-region precompute with shorter retention (200 ids instead of 800).
Strict chronological order (no ranking)	Simpler architecture — the ZSET score is just timestamp. Engagement drops 20–30% in published research.
Realtime ranking (every read recomputes)	The ranking model has to run inline. Adds 30–50 ms to P99 timeline; lets you personalise on signals from the last second.
End-to-end encrypted DMs	Out of scope for tweets but the design is fundamentally different — encrypted at the device, server stores ciphertext only. Search becomes impossible without client-side indexes.
"Edit tweet" feature	Tweets become mutable. Fan-out workers re-emit edited versions; downstream caches need invalidation. Easier if you make edits a new tweet that displaces the old in the timeline.
"What would a more senior answer add?"	The trust + safety surface: spam classification on the tweet service hot path, the report-and-takedown pipeline, the misinformation flagging pipeline. Plus the data-engineering layer: a Kafka topic of every tweet, fanned out to data warehousing for analytics, search index, ranking model retraining. Treat the data pipeline as a first-class component, not an afterthought.