16 / 19

Playbook / 16

Design Instagram

500M DAU, ~1B photo uploads/day, average photo 2 MB raw and 200 KB after processing. Where Twitter is small payloads at high read fan-out, Instagram is the opposite — large payloads at write-heavy scale. Bandwidth is the dominant cost. The architecture is built around an async image-processing pipeline and a CDN that absorbs almost every read.

1 · Clarifying questions

Functional scope?	Upload photos, follow users, view feed, view profile, like, comment. No stories, no DMs, no shopping.
Media size?	Raw upload: 2–8 MB. We serve multiple resolutions (150px, 320px, 640px, 1080px) — total ~250 KB per processed photo.
Scale?	500M DAU. ~1B uploads/day (~11K writes/sec average, 40K peak). Each photo viewed ~20× → ~20B views/day (~230K reads/sec, ~1M peak).
Latency?	Upload: feels-instant feedback (chunked + parallel). Processed-and-visible within 30 s. Feed load P99 ≤ 300 ms.
Storage durability?	11 nines (the photo must not be lost — emotional + sometimes legal weight).
Bandwidth?	The dominant cost. Egress is the line item that defines profitability.
Multi-region?	Yes. Uploads region-local with cross-region async replication. Reads served by closest edge.

2 · Capacity math, on a napkin

Number	Calculation	Result
DAU	given	500M
Photos/day	500M × 2 uploads	1B
Upload QPS (avg / peak)	1B / 86,400 × 4	~11K / ~44K
View QPS (avg / peak)	20B / 86,400 × 5	~230K / ~1.2M
Raw ingest bandwidth (peak)	44K × 4 MB	~1.4 Tbps
Processed photo size	4 variants × ~60 KB avg	~250 KB total
Daily raw storage (one copy)	1B × 4 MB	~4 PB/day raw
Daily processed storage (3× replicas)	1B × 250 KB × 3	~750 TB/day
Yearly cold storage	4 PB/day × 365	~1.5 EB/year (tier most after 30 days)
Read egress at peak (no CDN)	1.2M × 200 KB	~1.9 Tbps origin
Read egress with CDN (95% hit)	×0.05	~95 Gbps origin — survivable
CDN egress at peak	1.2M × 200 KB	~1.9 Tbps from edge — pay by the TB

The two headline numbers: 1.4 Tbps ingest at peak (so uploads go straight to object storage, not through your app servers) and the CDN doing ~95% of egress (so the origin sees a tiny fraction of the read traffic).

3 · API and data model

Upload flow

POST /v1/uploads/init                  # client requests upload
{ "filename": "img.jpg", "size_bytes": 4194304, "content_type": "image/jpeg" }
→ 200 {
  "upload_id": "u_8aB3x9Q2",
  "presigned_url": "https://...s3...?...",  # client uploads directly to object storage
  "parts": 4,                                  # multipart for files >5 MB
  "expires_at": "..."
}

POST /v1/uploads/:id/complete          # after S3 multipart complete
→ 201 { "photo_id": "p_8aB3x9Q2", "status": "processing" }

GET  /v1/photos/:id                    # metadata + URLs to processed variants
→ 200 {
  "id": "p_8aB3x9Q2",
  "owner": "u_abc",
  "caption": "...",
  "variants": {
    "150":  "https://cdn.../p_8aB3x9Q2/150.jpg",
    "320":  "...",
    "640":  "...",
    "1080": "..."
  },
  "status": "ready"
}

GET  /v1/users/:id/feed                # follow-graph feed

Storage

photos                           -- Cassandra; partitioned by owner_id
  photo_id      BIGINT PRIMARY KEY  -- Snowflake; time-ordered
  owner_id      BIGINT
  caption       TEXT
  raw_key       TEXT                 -- S3 key of original
  variant_keys  MAP<TEXT, TEXT>      -- "150" → s3 key, etc.
  status        VARCHAR(16)          -- uploading | processing | ready | failed
  created_at    TIMESTAMP
  INDEX (owner_id, created_at)

photo_metadata                   -- exif, dims, hashes
  photo_id      BIGINT PRIMARY KEY
  width, height INT
  exif          JSONB
  phash         BIGINT               -- perceptual hash for dedupe
  sha256        VARCHAR(64)

likes                            -- one row per (user, photo)
  user_id, photo_id BIGINT
  liked_at TIMESTAMP
  PRIMARY KEY ((photo_id), user_id)

follow_graph                     -- as Twitter

feed                             -- precomputed home feed
  user:{id}:feed → ZSET<photo_id, score>
                   (kept to ~500 ids)

object_store (S3-shape)
  /raw/{photo_id}                -- original; cold tier after 30 days
  /variants/{photo_id}/{size}.jpg -- served via CDN

4 · High-level architecture

Client uploads to S3 directly via a presigned URL — the API server never holds the bytes. On complete, the API enqueues a processing job. Workers fetch the raw image, generate variants, write them back to a different bucket, update metadata, then run moderation and emit a fan-out event. Reads serve from the CDN; the origin only fires on cache miss.

5 · The hard part — the upload + processing pipeline

Why direct-to-S3?

At 1.4 Tbps ingest, you cannot push bytes through your app servers. Two reasons: bandwidth cost (you're paying for ingress twice — client → app, app → S3), and CPU cost (your app servers become I/O-bound babysitters). Presigned URLs let the client write directly to S3. The API only deals in small JSON messages — init request, complete callback, status checks.

Multipart upload

Files > 5 MB get split into 5 MB parts. Each part uploaded in parallel; one failed part is retried, not the whole upload. Critical on mobile networks where a 10 MB upload over 4G has a real failure rate. On complete, S3 stitches the parts into one object and the API gets the callback.

Processing pipeline

# Worker pulls from Kafka topic 'photos_uploaded'
def process(photo_id):
  raw = s3.get(f"raw/{photo_id}")
  exif = extract_exif(raw)
  phash = perceptual_hash(raw)               # for dedupe + moderation lookup
  sha = sha256(raw)

  if dedupe.is_known_bad(phash):              # NSFW, CSAM, malware
    mark_status(photo_id, "rejected")
    return

  variants = {}
  for size in [150, 320, 640, 1080]:
    img = resize_and_compress(raw, size, quality=85, format="jpeg")
    key = f"variants/{photo_id}/{size}.jpg"
    s3.put(key, img, cache_control="public,max-age=31536000,immutable")
    variants[size] = key

  cassandra.update(photo_id, {
    "variant_keys": variants,
    "phash": phash,
    "sha256": sha,
    "exif": exif,
    "status": "ready"
  })

  kafka.send("photo_ready", {photo_id, owner_id})  # triggers fan-out

Each worker handles ~10 photos/sec (CPU-bound on resize). At 44K peak uploads/sec, we need ~5,000 worker pods. Auto-scale by Kafka lag.

The CDN cache-control story

Variants are written with Cache-Control: public, max-age=31536000, immutable. The URL contains the photo_id, which is content-addressed by the variant — so the object at that URL never changes. The CDN can hold it forever. This is what gets the cache-hit ratio to 95%+.

Edits are a new photo_id (or at minimum a new URL with a version suffix). Deletes are handled by tombstoning the metadata and letting the CDN entries expire naturally; URLs become 404 once the object is gone from S3.

6 · Storage tiering

Tier	Holds	Cost	Access
S3 Standard	Raw + variants for first 30 days	~$0.023/GB/month	Always-on; ms latency
S3 Standard-IA	Variants for 30 days to 1 year	~$0.012/GB/month	Slight retrieval fee; still ms latency
S3 Glacier Instant	Raw originals after 30 days	~$0.004/GB/month	ms latency but retrieval costs more
S3 Glacier Deep Archive	Raw originals after 1 year (rare access)	~$0.00099/GB/month	Hours to retrieve; only for compliance / restore

Variants are kept hot — they're the read path. Originals can move to colder tiers because we rarely re-process old photos. The lifecycle policy moves objects automatically based on age.

7 · Failure modes & runbook

Failure	Symptom	Mitigation
Processing backlog	Kafka lag > 5 min; photos stuck in "processing"	Auto-scale worker pool by lag. Show "still uploading" UI; never block the user's profile.
S3 throttle on a hot prefix	503s from S3 for new uploads in a single prefix	Spread photo_ids across prefixes (high-entropy prefix). Modern S3 auto-shards but the pattern still matters.
CDN miss storm	Cold variant requested by N users at once → N origin hits	Edge request coalescing (Cloudflare Argo, CloudFront origin shield). One origin fetch, N edge responses.
Moderation false positive	Valid photo rejected; user upset	Two-stage moderation: hard auto-reject for hash-list matches; soft flag for ML positives, human review SLA.
Region-failover during upload	Multipart upload spanning regions fails	Region-pinned upload IDs; on region failover, client restarts upload from scratch (state is in the client).
Origin overload during influencer event	One photo requested 10M times in 5 minutes	CDN absorbs; origin shield further; in extreme cases pre-warm the CDN by pushing the variants on publish.
Storage costs growing faster than revenue	Quarterly review flags raw-bucket cost	Aggressive lifecycle policies; deduplication by phash for spam re-uploads; eventual deletion of inactive accounts.

8 · Cost & SLOs

Line	Estimate	Note
API tier (1K pods)	~$50K/month	Stateless; sized for control-plane traffic
Processing workers (~5K pods)	~$200K/month	CPU-bound; bursty
S3 storage (~10 EB across tiers)	~$15M/month	Lifecycle-tiered; would be 3× without tiering
CDN egress (~150 PB/month)	~$3M/month	Negotiated rate; without CDN this would be 5–10× the origin egress bill
Cassandra (metadata)	~$60K/month	Photo metadata is small per row; rows are many
Kafka (multi-region)	~$80K/month	Photo-uploaded + photo-ready + fan-out topics
Moderation (CPU + GPU)	~$300K/month	ML models on every upload

SLOs

Upload "feels instant". First-byte to client < 200 ms (the presigned URL request). Bytes go straight to S3 — client measures upload speed.
Photo ready P99: 30 s. From upload complete to visible in profile.
Feed load P99: 300 ms. Feed metadata read; variant URLs included; CDN handles the bytes.
CDN hit ratio ≥ 95%. Anything lower and the origin egress bill explodes.
Durability: 11 nines. S3-class. Cross-region replication for hot data; cross-region async for cold.

9 · Trade-offs & "what would you change at 10×"

If…	Then…
10× users (5B DAU)	The math holds — the architecture scales linearly. The pinch point becomes per-region CDN capacity; provision multiple CDN providers per region for redundancy and pricing use.
Video instead of photos	Variants become bitrate ladders (HLS / DASH). Processing is 100× more expensive per item. Storage 10–50× larger. The CDN strategy stays — segments are equally cacheable.
Real-time filters (AR-style)	Processing moves to the client. Server stores the already-filtered output. Reduces server CPU; needs careful capability detection per device.
True end-to-end encryption	Storage tiering still works — ciphertext is cacheable. Server-side moderation becomes impossible; safety moves to the device or to opt-in scanning at upload.
Edit history (every version preserved)	Each edit creates a new photo_id linked to the parent; storage grows but cheap. CDN strategy unchanged.
"What would a more senior answer add?"	The trust + safety layer in earnest: CSAM hashing pipeline (PhotoDNA-style), proactive ML scanning, jurisdiction-aware reporting. Plus the data-residency story for GDPR (EU photos stored only in EU regions, with the lifecycle policies and CDN configuration that enforces it). Plus the chargeback/cost-attribution system that lets product leaders see "feature X costs $Y/month" — at this scale the cost story IS the product story.