Object storage
An S3-shape blob store. Buckets and keys on the outside; a metadata plane and a data plane on the inside. The interesting parts are how durability is bought (erasure coding vs replication), how a 5 TB upload is delivered in one logical PUT (multipart), and how listings became strongly consistent in 2020 without an outage. The shape generalises: photo storage, log archives, container registries, ML training corpora — most large blob systems end up here.
1 · Clarifying questions
Five minutes. Durability targets and object size range matter most — they drive the erasure scheme, the chunk size, and the metadata footprint.
| Durability target? | Eleven nines (99.999999999%) is the S3 marketing number. Nine nines is achievable with cheaper schemes and is honest for most workloads — confirm. |
| Availability target? | 99.99% for reads, 99.9% for writes is typical. Writes are tighter to justify because they require a quorum of new shards. |
| Object size range? | 1 byte to 5 TB. The range matters: tiny objects burn metadata; huge objects need multipart and parallel transfer. |
| Read / write ratio? | 10:1 typical. 50:1 for media-heavy workloads (video, training data). Affects how aggressively we cache the metadata plane. |
| Listing semantics? | S3 went from eventual to strong in 2020. Strong is the right default now; eventual lets us cache list responses for longer. |
| Multi-region replication? | Active-passive is the common default (asynchronous, RPO ≈ 15 minutes). Active-active is much harder and rarely needed. |
| Versioning? | Yes — overwrites become new versions, deletes are tombstones. Costs storage but fixes the "I overwrote prod by accident" class of bug. |
| Access patterns? | Hot tier (frequent reads), warm tier, cold tier (Glacier-equivalent, minutes-to-hours retrieval). Drives the placement and pricing model. |
2 · Capacity math, on a napkin
S3-scale public numbers. They sound enormous but they decompose into well-understood subsystems — a sharded KV for metadata, a fleet of chunk servers for bytes.
| Number | Calculation | Result |
|---|---|---|
| Total objects | given (S3-scale) | 100B |
| Average object size | given | 1 MB |
| Raw bytes | 100B × 1 MB | 100 PB |
| With erasure overhead | × 1.5 (6+3 Reed-Solomon) | 150 PB |
| PUT QPS (sustained) | given | 100K / sec |
| GET QPS (peak) | given | 1M / sec |
| Metadata rows | one per object | 100B |
| Metadata size / row | bucket + key + chunk map + ACL | ~1 KB |
| Metadata total | 100B × 1 KB | ~100 TB |
| Metadata shards | 100 TB / 1 TB per shard | ~100 shards (round up for headroom) |
| Chunk servers | 150 PB / 100 TB per node | ~1,500 nodes (then × replication factor for the cluster) |
100 TB of metadata sounds large but it's tractable in a sharded KV. The bytes are the expensive line — storage cost dominates the bill, not request cost.
3 · API and data model
S3-compatible on the outside. Inside, the metadata plane and the data plane are separate services with separate scaling stories.
Endpoints
PUT /:bucket/:key # single-shot upload (≤ 5 GB)
body: object bytes
headers: Content-Length, Content-Type, x-amz-meta-*
→ 200 {"etag":"...", "version":"..."}
POST /:bucket/:key?uploads # initiate multipart
→ 200 {"upload_id":"..."}
PUT /:bucket/:key?partNumber=N&uploadId=... # upload a part (5 MB – 5 GB)
→ 200 {"etag":"part-etag"}
POST /:bucket/:key?uploadId=... # complete (list of part etags)
→ 200 {"etag":"final", "version":"..."}
GET /:bucket/:key # download (supports Range)
→ 200 body
DELETE /:bucket/:key # tombstone (or new delete-marker version)
→ 204
GET /:bucket?prefix=&marker=&max-keys=
→ 200 list of keys, strongly consistentSchema (metadata plane, sharded KV)
object_metadata
bucket STRING
key STRING -- arbitrary UTF-8, up to 1024 bytes
version STRING -- monotonic per (bucket, key)
size BIGINT
etag STRING -- MD5 (single-shot) or composite (multipart)
content_type STRING
chunk_map LIST<chunk_ref> -- ordered list of chunk IDs + offsets
acl STRING -- inline if small, else reference
created_at TIMESTAMP
is_delete_marker BOOLEAN
storage_class STRING -- HOT / WARM / COLD
PRIMARY KEY ((bucket, key), version DESC)
shard by hash(bucket, key)
chunk_locations
chunk_id STRING (UUID)
shards LIST<shard_ref> -- 9 entries for 6+3, with rack + AZ
size INT
PRIMARY KEY (chunk_id)
shard by hash(chunk_id)Two tables, two sharding keys. object_metadata answers "what chunks make up this object". chunk_locations answers "where on disk does this chunk live". The split lets the placement service rebalance chunks without touching object metadata.
4 · High-level architecture
Two planes that share an API gateway. The metadata plane is small, hot, and sharded. The data plane is large, cold(er), and parallel.
Data flows separately from metadata. The gateway authenticates; the router hashes; the metadata service hands back chunk locations; the client (or the gateway, for small objects) streams bytes directly to chunk servers. The repair scanner runs continuously in the background, rebuilding shards from parity wherever it finds gaps.
5 · Durability — erasure coding and placement
How eleven nines is actually paid for. Triple replication is the simple answer; erasure coding is the answer at scale.
| Scheme | Overhead | Failure tolerance | Note |
|---|---|---|---|
| 3× replication | 200% (3 GB for 1 GB) | any 2 of 3 copies | Simple. Used by HDFS by default. Cheap on small objects. |
| Reed-Solomon 6+3 | 50% (1.5 GB for 1 GB) | any 3 of 9 shards | Each chunk → 6 data shards + 3 parity. The S3-class default. |
| Reed-Solomon 10+4 | 40% | any 4 of 14 | Better overhead, worse repair throughput. Used for archival tiers. |
The recommendation: Reed-Solomon 6+3 across 3 AZs. Each chunk becomes 9 shards — 6 data, 3 parity — placed 3 per AZ. The system tolerates the loss of any 3 shards simultaneously, which covers a disk, a rack, or an entire AZ. Storage overhead is 50% versus 200% for replication; for a 100 PB store that is a 150 PB bill rather than 300 PB.
LIST for seconds. Strong
listings require the metadata shards to serve the list directly, not an asynchronous
projection.6 · Multipart upload
A 5 TB object cannot be a single HTTP request — too much sits in flight, retries are catastrophic, and no client survives a mid-flight TCP drop on a 12-hour upload. Multipart breaks the upload into independently-uploaded parts that the server stitches together at the end.
- Initiate. Client POSTs to get an
upload_id. Server records "in-progress upload" in the metadata plane. - Upload parts in parallel. Each part is 5 MB – 5 GB. Parts can upload in any order, can be retried independently, and each returns an etag. Up to 10,000 parts per upload.
- Complete. Client sends the list of part numbers + etags. Server validates each part exists, stitches the chunk_map, and writes the final metadata in one commit. The composite etag is the MD5 of the concatenated part etags (which is why S3 etags for multipart objects aren't the file's MD5).
- Abort. Failed uploads can be explicitly aborted, or the GC scanner sweeps parts older than ~7 days. Without GC, abandoned multipart uploads accumulate forever as billed dark storage.
7 · Failure modes & runbook
The operational story. Most of these are continuous background processes rather than page-oncall events — the system is designed to absorb constant low-level failure.
| Failure | Symptom | Mitigation |
|---|---|---|
| Disk failure (continuous) | Shards on that disk unavailable | Repair scanner detects, rebuilds missing shards from parity onto a fresh disk. Continuous; no page unless rebuild rate falls behind failure rate. |
| Rack failure | ~40 disks gone at once | Placement spreads shards across racks; no single chunk loses more than ~1 shard per rack. Repair runs at higher priority. |
| AZ failure | 1/3 of shards offline | 6+3 across 3 AZs tolerates losing one AZ entirely. Reads served from the surviving 6 shards (still enough to reconstruct). |
| Hot prefix | One bucket prefix takes disproportionate metadata traffic | Recommend randomised prefixes in client SDKs; shard the prefix index further; cache list responses at the gateway. |
| Orphaned chunks | Multipart upload abandoned mid-flight | GC scanner sweeps chunks without a metadata reference older than ~7 days. Billed until swept. |
| Listing inconsistency (historical) | Freshly-PUT key missing from LIST | Resolved by S3's 2020 change: serve listings directly from metadata shards, not from an async projection. |
| Slow read (tail latency) | One chunk server slow; whole GET slow | Hedged reads — request from 7 shards in parallel, return when first 6 succeed. |
| Bit rot on disk | Silent data corruption | Per-chunk checksums verified on every read; scrubber re-reads cold data on a multi-week cycle and rebuilds any chunk whose checksum fails. |
8 · Cost, operability, and trade-offs
Storage is the cost line that dominates everything else. Request and bandwidth costs matter but are usually secondary; the spreadsheet for an object store is mostly $/GB-month × total bytes.
| Line | Estimate | Note |
|---|---|---|
| Hot-tier storage | $0.023 / GB-month | S3 Standard price. At 100 PB raw → ~$2.3M / month before erasure overhead. |
| Cold-tier storage | $0.001 – $0.004 / GB-month | Glacier / Deep Archive — 5–10× cheaper. Retrieval takes minutes to hours. |
| PUT / GET requests | $0.005 / 1K PUT, $0.0004 / 1K GET | At 100K PUT/sec → ~$1.3M / month. Real but smaller than storage. |
| Bandwidth out | $0.05 – $0.09 / GB | The famous egress markup. The "bandwidth pricing vs storage pricing" dance is a known cost optimisation — keep heavy traffic in-region. |
Trade-offs and what's next
- Strong vs eventual consistency. S3 went strong in 2020 by changing how listings are built — metadata shards serve lists directly. The cost is more load on the metadata plane; the payoff is that read-after-write applies to listings too.
- Cross-region replication. Active-passive (async, RPO ≈ 15 min) is straightforward — a background pipe replicates new versions to a peer region. RPO depends on the pipe's throughput vs the write rate.
- Multi-region active-active writes. Much harder than read replicas. Concurrent writes to the same key need conflict resolution — last-writer-wins on timestamp is the common compromise, with the honest caveat that clock skew loses writes.
- Conditional writes. S3 added
If-Match/If-None-Matchin 2024 for optimistic concurrency. Lets clients do compare-and-swap without an external lock service. Closes a long-standing gap for databases-on-object-storage. - Erasure scheme as a tier. Hot data on 6+3 (low repair latency); archival on 10+4 (better overhead, slower repair). Picking one scheme for all tiers leaves money on the table.
Defending the decisions — interview pushback
Six choices an interviewer will probe. Have a one-sentence answer ready for each — the depth of these defenses, more than the diagram itself, is what separates a "passing" from a "strong-hire" signal.
- "Why separate metadata and data planes?" Different SLOs, different access patterns, different scaling stories. Metadata is small, indexed, needs atomic updates. Data is large, sequential, immutable once written. Mixing them means you compromise both.
- "Why erasure coding instead of 3× replication?" At petabyte scale, 3× means 200% overhead. A 6+3 erasure code costs ~50%. The math is straightforward once seen; at exabyte scale it's not optional.
- "Why pre-signed URLs?" Application servers should never proxy bytes for large objects — it pins connections, double-pays for bandwidth, adds latency. The server mints a signed URL; the client uploads directly to storage.
- "Why two-phase write — data first, metadata second?" Data first is the recoverable direction. Metadata first plus a data failure leaves a tombstone pointing nowhere. Data first plus a metadata failure leaves orphan blobs that GC reclaims. Always write the hard-to-recover thing last.
- "Why a continuous repair scanner?" Disk failures happen constantly at fleet scale. The scanner reconstitutes missing shards from surviving parity. If repair rate falls below failure rate, durability silently degrades — and you don't notice until it's too late.
- "Why lifecycle policies as first-class?" Most data goes cold after 30 days. Automatic tiering to cheaper storage is a 5–10× cost reduction per byte — a feature you build in, not an afterthought.
Further reading
- AWS — "Diving Deep on S3 Consistency" (Andy Warfield, re:Invent 2021). The public account of how S3 went from eventual to strong listings without an outage. The architectural detail is worth the talk's full hour.
- Facebook — "f4: Facebook's Warm BLOB Storage System" (OSDI 2014). Reed-Solomon erasure coding for a photo-store workload. Pairs nicely with the older Haystack paper for hot/warm/cold tiering.
- Google — "Colossus: Successor to GFS". The metadata/data split, the chunk server model, and how erasure coding replaced 3× replication.
- Backblaze — "Reed-Solomon erasure coding from scratch". A walkthrough of the math with working Java. The most accessible introduction to RS that exists.
- James Hamilton — "On Designing and Deploying Internet-Scale Services". Not object-storage-specific, but the operational guidance ("expect failure", "design for many small components") shaped how S3 was built.
- Adjacent: File storage design handbook. The block / file / object spectrum and when each fits.
- Adjacent: How CDNs work. The layer that fronts an object store for public-read workloads.