Design Instagram
500M DAU, ~1B photo uploads/day, average photo 2 MB raw and 200 KB after processing. Where Twitter is small payloads at high read fan-out, Instagram is the opposite — large payloads at write-heavy scale. Bandwidth is the dominant cost. The architecture is built around an async image-processing pipeline and a CDN that absorbs almost every read.
1 · Clarifying questions
| Functional scope? | Upload photos, follow users, view feed, view profile, like, comment. No stories, no DMs, no shopping. |
| Media size? | Raw upload: 2–8 MB. We serve multiple resolutions (150px, 320px, 640px, 1080px) — total ~250 KB per processed photo. |
| Scale? | 500M DAU. ~1B uploads/day (~11K writes/sec average, 40K peak). Each photo viewed ~20× → ~20B views/day (~230K reads/sec, ~1M peak). |
| Latency? | Upload: feels-instant feedback (chunked + parallel). Processed-and-visible within 30 s. Feed load P99 ≤ 300 ms. |
| Storage durability? | 11 nines (the photo must not be lost — emotional + sometimes legal weight). |
| Bandwidth? | The dominant cost. Egress is the line item that defines profitability. |
| Multi-region? | Yes. Uploads region-local with cross-region async replication. Reads served by closest edge. |
2 · Capacity math, on a napkin
| Number | Calculation | Result |
|---|---|---|
| DAU | given | 500M |
| Photos/day | 500M × 2 uploads | 1B |
| Upload QPS (avg / peak) | 1B / 86,400 × 4 | ~11K / ~44K |
| View QPS (avg / peak) | 20B / 86,400 × 5 | ~230K / ~1.2M |
| Raw ingest bandwidth (peak) | 44K × 4 MB | ~1.4 Tbps |
| Processed photo size | 4 variants × ~60 KB avg | ~250 KB total |
| Daily raw storage (one copy) | 1B × 4 MB | ~4 PB/day raw |
| Daily processed storage (3× replicas) | 1B × 250 KB × 3 | ~750 TB/day |
| Yearly cold storage | 4 PB/day × 365 | ~1.5 EB/year (tier most after 30 days) |
| Read egress at peak (no CDN) | 1.2M × 200 KB | ~1.9 Tbps origin |
| Read egress with CDN (95% hit) | ×0.05 | ~95 Gbps origin — survivable |
| CDN egress at peak | 1.2M × 200 KB | ~1.9 Tbps from edge — pay by the TB |
The two headline numbers: 1.4 Tbps ingest at peak (so uploads go straight to object storage, not through your app servers) and the CDN doing ~95% of egress (so the origin sees a tiny fraction of the read traffic).
3 · API and data model
Upload flow
POST /v1/uploads/init # client requests upload
{ "filename": "img.jpg", "size_bytes": 4194304, "content_type": "image/jpeg" }
→ 200 {
"upload_id": "u_8aB3x9Q2",
"presigned_url": "https://...s3...?...", # client uploads directly to object storage
"parts": 4, # multipart for files >5 MB
"expires_at": "..."
}
POST /v1/uploads/:id/complete # after S3 multipart complete
→ 201 { "photo_id": "p_8aB3x9Q2", "status": "processing" }
GET /v1/photos/:id # metadata + URLs to processed variants
→ 200 {
"id": "p_8aB3x9Q2",
"owner": "u_abc",
"caption": "...",
"variants": {
"150": "https://cdn.../p_8aB3x9Q2/150.jpg",
"320": "...",
"640": "...",
"1080": "..."
},
"status": "ready"
}
GET /v1/users/:id/feed # follow-graph feedStorage
photos -- Cassandra; partitioned by owner_id
photo_id BIGINT PRIMARY KEY -- Snowflake; time-ordered
owner_id BIGINT
caption TEXT
raw_key TEXT -- S3 key of original
variant_keys MAP<TEXT, TEXT> -- "150" → s3 key, etc.
status VARCHAR(16) -- uploading | processing | ready | failed
created_at TIMESTAMP
INDEX (owner_id, created_at)
photo_metadata -- exif, dims, hashes
photo_id BIGINT PRIMARY KEY
width, height INT
exif JSONB
phash BIGINT -- perceptual hash for dedupe
sha256 VARCHAR(64)
likes -- one row per (user, photo)
user_id, photo_id BIGINT
liked_at TIMESTAMP
PRIMARY KEY ((photo_id), user_id)
follow_graph -- as Twitter
feed -- precomputed home feed
user:{id}:feed → ZSET<photo_id, score>
(kept to ~500 ids)
object_store (S3-shape)
/raw/{photo_id} -- original; cold tier after 30 days
/variants/{photo_id}/{size}.jpg -- served via CDN4 · High-level architecture
Client uploads to S3 directly via a presigned URL — the API server never holds the bytes. On complete, the API enqueues a processing job. Workers fetch the raw image, generate variants, write them back to a different bucket, update metadata, then run moderation and emit a fan-out event. Reads serve from the CDN; the origin only fires on cache miss.
5 · The hard part — the upload + processing pipeline
Why direct-to-S3?
At 1.4 Tbps ingest, you cannot push bytes through your app servers. Two reasons: bandwidth cost (you're paying for ingress twice — client → app, app → S3), and CPU cost (your app servers become I/O-bound babysitters). Presigned URLs let the client write directly to S3. The API only deals in small JSON messages — init request, complete callback, status checks.
Multipart upload
Files > 5 MB get split into 5 MB parts. Each part uploaded in parallel; one failed part is retried, not the whole upload. Critical on mobile networks where a 10 MB upload over 4G has a real failure rate. On complete, S3 stitches the parts into one object and the API gets the callback.
Processing pipeline
# Worker pulls from Kafka topic 'photos_uploaded'
def process(photo_id):
raw = s3.get(f"raw/{photo_id}")
exif = extract_exif(raw)
phash = perceptual_hash(raw) # for dedupe + moderation lookup
sha = sha256(raw)
if dedupe.is_known_bad(phash): # NSFW, CSAM, malware
mark_status(photo_id, "rejected")
return
variants = {}
for size in [150, 320, 640, 1080]:
img = resize_and_compress(raw, size, quality=85, format="jpeg")
key = f"variants/{photo_id}/{size}.jpg"
s3.put(key, img, cache_control="public,max-age=31536000,immutable")
variants[size] = key
cassandra.update(photo_id, {
"variant_keys": variants,
"phash": phash,
"sha256": sha,
"exif": exif,
"status": "ready"
})
kafka.send("photo_ready", {photo_id, owner_id}) # triggers fan-outEach worker handles ~10 photos/sec (CPU-bound on resize). At 44K peak uploads/sec, we need ~5,000 worker pods. Auto-scale by Kafka lag.
The CDN cache-control story
Variants are written with Cache-Control: public, max-age=31536000, immutable. The URL contains the photo_id, which is content-addressed by the variant — so the object at that URL never changes. The CDN can hold it forever. This is what gets the cache-hit ratio to 95%+.
Edits are a new photo_id (or at minimum a new URL with a version suffix). Deletes are handled by tombstoning the metadata and letting the CDN entries expire naturally; URLs become 404 once the object is gone from S3.
6 · Storage tiering
| Tier | Holds | Cost | Access |
|---|---|---|---|
| S3 Standard | Raw + variants for first 30 days | ~$0.023/GB/month | Always-on; ms latency |
| S3 Standard-IA | Variants for 30 days to 1 year | ~$0.012/GB/month | Slight retrieval fee; still ms latency |
| S3 Glacier Instant | Raw originals after 30 days | ~$0.004/GB/month | ms latency but retrieval costs more |
| S3 Glacier Deep Archive | Raw originals after 1 year (rare access) | ~$0.00099/GB/month | Hours to retrieve; only for compliance / restore |
Variants are kept hot — they're the read path. Originals can move to colder tiers because we rarely re-process old photos. The lifecycle policy moves objects automatically based on age.
7 · Failure modes & runbook
| Failure | Symptom | Mitigation |
|---|---|---|
| Processing backlog | Kafka lag > 5 min; photos stuck in "processing" | Auto-scale worker pool by lag. Show "still uploading" UI; never block the user's profile. |
| S3 throttle on a hot prefix | 503s from S3 for new uploads in a single prefix | Spread photo_ids across prefixes (high-entropy prefix). Modern S3 auto-shards but the pattern still matters. |
| CDN miss storm | Cold variant requested by N users at once → N origin hits | Edge request coalescing (Cloudflare Argo, CloudFront origin shield). One origin fetch, N edge responses. |
| Moderation false positive | Valid photo rejected; user upset | Two-stage moderation: hard auto-reject for hash-list matches; soft flag for ML positives, human review SLA. |
| Region-failover during upload | Multipart upload spanning regions fails | Region-pinned upload IDs; on region failover, client restarts upload from scratch (state is in the client). |
| Origin overload during influencer event | One photo requested 10M times in 5 minutes | CDN absorbs; origin shield further; in extreme cases pre-warm the CDN by pushing the variants on publish. |
| Storage costs growing faster than revenue | Quarterly review flags raw-bucket cost | Aggressive lifecycle policies; deduplication by phash for spam re-uploads; eventual deletion of inactive accounts. |
8 · Cost & SLOs
| Line | Estimate | Note |
|---|---|---|
| API tier (1K pods) | ~$50K/month | Stateless; sized for control-plane traffic |
| Processing workers (~5K pods) | ~$200K/month | CPU-bound; bursty |
| S3 storage (~10 EB across tiers) | ~$15M/month | Lifecycle-tiered; would be 3× without tiering |
| CDN egress (~150 PB/month) | ~$3M/month | Negotiated rate; without CDN this would be 5–10× the origin egress bill |
| Cassandra (metadata) | ~$60K/month | Photo metadata is small per row; rows are many |
| Kafka (multi-region) | ~$80K/month | Photo-uploaded + photo-ready + fan-out topics |
| Moderation (CPU + GPU) | ~$300K/month | ML models on every upload |
SLOs
- Upload "feels instant". First-byte to client < 200 ms (the presigned URL request). Bytes go straight to S3 — client measures upload speed.
- Photo ready P99: 30 s. From upload complete to visible in profile.
- Feed load P99: 300 ms. Feed metadata read; variant URLs included; CDN handles the bytes.
- CDN hit ratio ≥ 95%. Anything lower and the origin egress bill explodes.
- Durability: 11 nines. S3-class. Cross-region replication for hot data; cross-region async for cold.
9 · Trade-offs & "what would you change at 10×"
| If… | Then… |
|---|---|
| 10× users (5B DAU) | The math holds — the architecture scales linearly. The pinch point becomes per-region CDN capacity; provision multiple CDN providers per region for redundancy and pricing use. |
| Video instead of photos | Variants become bitrate ladders (HLS / DASH). Processing is 100× more expensive per item. Storage 10–50× larger. The CDN strategy stays — segments are equally cacheable. |
| Real-time filters (AR-style) | Processing moves to the client. Server stores the already-filtered output. Reduces server CPU; needs careful capability detection per device. |
| True end-to-end encryption | Storage tiering still works — ciphertext is cacheable. Server-side moderation becomes impossible; safety moves to the device or to opt-in scanning at upload. |
| Edit history (every version preserved) | Each edit creates a new photo_id linked to the parent; storage grows but cheap. CDN strategy unchanged. |
| "What would a more senior answer add?" | The trust + safety layer in earnest: CSAM hashing pipeline (PhotoDNA-style), proactive ML scanning, jurisdiction-aware reporting. Plus the data-residency story for GDPR (EU photos stored only in EU regions, with the lifecycle policies and CDN configuration that enforces it). Plus the chargeback/cost-attribution system that lets product leaders see "feature X costs $Y/month" — at this scale the cost story IS the product story. |
Further reading
- Instagram Engineering blog — multiple posts on Cassandra, sharding, the photo pipeline. The clearest first-party source on this design.
- "What Powers Instagram: Hundreds of Instances, Dozens of Technologies". An early but still-instructive overview of the stack.
- "Sharding & IDs at Instagram". Their Snowflake-style ID scheme; relevant to the photo_id field used throughout.
- Adjacent: Object storage. The S3-shape underneath.
- Adjacent: CDN. The pattern that makes the read path affordable.
- Adjacent: News feed. The feed delivery; same pattern, different payload.
- Adjacent: Napkin math. The Instagram preset is in the live worksheet.