14 / 19

Playbook / 14

Object storage

An S3-shape blob store. Buckets and keys on the outside; a metadata plane and a data plane on the inside. The interesting parts are how durability is bought (erasure coding vs replication), how a 5 TB upload is delivered in one logical PUT (multipart), and how listings became strongly consistent in 2020 without an outage. The shape generalises: photo storage, log archives, container registries, ML training corpora — most large blob systems end up here.

1 · Clarifying questions

Five minutes. Durability targets and object size range matter most — they drive the erasure scheme, the chunk size, and the metadata footprint.

Durability target?	Eleven nines (99.999999999%) is the S3 marketing number. Nine nines is achievable with cheaper schemes and is honest for most workloads — confirm.
Availability target?	99.99% for reads, 99.9% for writes is typical. Writes are tighter to justify because they require a quorum of new shards.
Object size range?	1 byte to 5 TB. The range matters: tiny objects burn metadata; huge objects need multipart and parallel transfer.
Read / write ratio?	10:1 typical. 50:1 for media-heavy workloads (video, training data). Affects how aggressively we cache the metadata plane.
Listing semantics?	S3 went from eventual to strong in 2020. Strong is the right default now; eventual lets us cache list responses for longer.
Multi-region replication?	Active-passive is the common default (asynchronous, RPO ≈ 15 minutes). Active-active is much harder and rarely needed.
Versioning?	Yes — overwrites become new versions, deletes are tombstones. Costs storage but fixes the "I overwrote prod by accident" class of bug.
Access patterns?	Hot tier (frequent reads), warm tier, cold tier (Glacier-equivalent, minutes-to-hours retrieval). Drives the placement and pricing model.

2 · Capacity math, on a napkin

S3-scale public numbers. They sound enormous but they decompose into well-understood subsystems — a sharded KV for metadata, a fleet of chunk servers for bytes.

Number	Calculation	Result
Total objects	given (S3-scale)	100B
Average object size	given	1 MB
Raw bytes	100B × 1 MB	100 PB
With erasure overhead	× 1.5 (6+3 Reed-Solomon)	150 PB
PUT QPS (sustained)	given	100K / sec
GET QPS (peak)	given	1M / sec
Metadata rows	one per object	100B
Metadata size / row	bucket + key + chunk map + ACL	~1 KB
Metadata total	100B × 1 KB	~100 TB
Metadata shards	100 TB / 1 TB per shard	~100 shards (round up for headroom)
Chunk servers	150 PB / 100 TB per node	~1,500 nodes (then × replication factor for the cluster)

100 TB of metadata sounds large but it's tractable in a sharded KV. The bytes are the expensive line — storage cost dominates the bill, not request cost.

3 · API and data model

S3-compatible on the outside. Inside, the metadata plane and the data plane are separate services with separate scaling stories.

Endpoints

PUT  /:bucket/:key                  # single-shot upload (≤ 5 GB)
     body: object bytes
     headers: Content-Length, Content-Type, x-amz-meta-*
  → 200 {"etag":"...", "version":"..."}

POST /:bucket/:key?uploads          # initiate multipart
  → 200 {"upload_id":"..."}

PUT  /:bucket/:key?partNumber=N&uploadId=...   # upload a part (5 MB – 5 GB)
  → 200 {"etag":"part-etag"}

POST /:bucket/:key?uploadId=...     # complete (list of part etags)
  → 200 {"etag":"final", "version":"..."}

GET  /:bucket/:key                  # download (supports Range)
  → 200 body

DELETE /:bucket/:key                # tombstone (or new delete-marker version)
  → 204

GET  /:bucket?prefix=&marker=&max-keys=
  → 200 list of keys, strongly consistent

Schema (metadata plane, sharded KV)

object_metadata
  bucket        STRING
  key           STRING              -- arbitrary UTF-8, up to 1024 bytes
  version       STRING              -- monotonic per (bucket, key)
  size          BIGINT
  etag          STRING              -- MD5 (single-shot) or composite (multipart)
  content_type  STRING
  chunk_map     LIST<chunk_ref>     -- ordered list of chunk IDs + offsets
  acl           STRING              -- inline if small, else reference
  created_at    TIMESTAMP
  is_delete_marker BOOLEAN
  storage_class STRING              -- HOT / WARM / COLD

  PRIMARY KEY ((bucket, key), version DESC)
  shard by hash(bucket, key)

chunk_locations
  chunk_id      STRING (UUID)
  shards        LIST<shard_ref>     -- 9 entries for 6+3, with rack + AZ
  size          INT
  PRIMARY KEY (chunk_id)
  shard by hash(chunk_id)

Two tables, two sharding keys. object_metadata answers "what chunks make up this object". chunk_locations answers "where on disk does this chunk live". The split lets the placement service rebalance chunks without touching object metadata.

4 · High-level architecture

Two planes that share an API gateway. The metadata plane is small, hot, and sharded. The data plane is large, cold(er), and parallel.

Data flows separately from metadata. The gateway authenticates; the router hashes; the metadata service hands back chunk locations; the client (or the gateway, for small objects) streams bytes directly to chunk servers. The repair scanner runs continuously in the background, rebuilding shards from parity wherever it finds gaps.

5 · Durability — erasure coding and placement

How eleven nines is actually paid for. Triple replication is the simple answer; erasure coding is the answer at scale.

Scheme	Overhead	Failure tolerance	Note
3× replication	200% (3 GB for 1 GB)	any 2 of 3 copies	Simple. Used by HDFS by default. Cheap on small objects.
Reed-Solomon 6+3	50% (1.5 GB for 1 GB)	any 3 of 9 shards	Each chunk → 6 data shards + 3 parity. The S3-class default.
Reed-Solomon 10+4	40%	any 4 of 14	Better overhead, worse repair throughput. Used for archival tiers.

The recommendation: Reed-Solomon 6+3 across 3 AZs. Each chunk becomes 9 shards — 6 data, 3 parity — placed 3 per AZ. The system tolerates the loss of any 3 shards simultaneously, which covers a disk, a rack, or an entire AZ. Storage overhead is 50% versus 200% for replication; for a 100 PB store that is a 150 PB bill rather than 300 PB.

Write ordering matters. Data shards land first; metadata updates second. The reverse — metadata first, then data — leaves tombstones pointing at bytes that may never exist, which presents to the client as "the object is there but the read 500s". Data-first leaves at worst orphaned chunks, which the GC scanner cleans up. Orphans are recoverable; missing data is not.

Read-after-write. The metadata plane is the source of truth for "does this object exist". Once the metadata commit returns, any subsequent GET on the same key sees the new object — that's the read-after-write guarantee. Listings were the historical weak point: until 2020, S3 served list responses from an eventually-consistent index and a freshly-PUT key could be missing from LIST for seconds. Strong listings require the metadata shards to serve the list directly, not an asynchronous projection.

6 · Multipart upload

A 5 TB object cannot be a single HTTP request — too much sits in flight, retries are catastrophic, and no client survives a mid-flight TCP drop on a 12-hour upload. Multipart breaks the upload into independently-uploaded parts that the server stitches together at the end.

Initiate. Client POSTs to get an upload_id. Server records "in-progress upload" in the metadata plane.
Upload parts in parallel. Each part is 5 MB – 5 GB. Parts can upload in any order, can be retried independently, and each returns an etag. Up to 10,000 parts per upload.
Complete. Client sends the list of part numbers + etags. Server validates each part exists, stitches the chunk_map, and writes the final metadata in one commit. The composite etag is the MD5 of the concatenated part etags (which is why S3 etags for multipart objects aren't the file's MD5).
Abort. Failed uploads can be explicitly aborted, or the GC scanner sweeps parts older than ~7 days. Without GC, abandoned multipart uploads accumulate forever as billed dark storage.

7 · Failure modes & runbook

The operational story. Most of these are continuous background processes rather than page-oncall events — the system is designed to absorb constant low-level failure.

Failure	Symptom	Mitigation
Disk failure (continuous)	Shards on that disk unavailable	Repair scanner detects, rebuilds missing shards from parity onto a fresh disk. Continuous; no page unless rebuild rate falls behind failure rate.
Rack failure	~40 disks gone at once	Placement spreads shards across racks; no single chunk loses more than ~1 shard per rack. Repair runs at higher priority.
AZ failure	1/3 of shards offline	6+3 across 3 AZs tolerates losing one AZ entirely. Reads served from the surviving 6 shards (still enough to reconstruct).
Hot prefix	One bucket prefix takes disproportionate metadata traffic	Recommend randomised prefixes in client SDKs; shard the prefix index further; cache list responses at the gateway.
Orphaned chunks	Multipart upload abandoned mid-flight	GC scanner sweeps chunks without a metadata reference older than ~7 days. Billed until swept.
Listing inconsistency (historical)	Freshly-PUT key missing from LIST	Resolved by S3's 2020 change: serve listings directly from metadata shards, not from an async projection.
Slow read (tail latency)	One chunk server slow; whole GET slow	Hedged reads — request from 7 shards in parallel, return when first 6 succeed.
Bit rot on disk	Silent data corruption	Per-chunk checksums verified on every read; scrubber re-reads cold data on a multi-week cycle and rebuilds any chunk whose checksum fails.

8 · Cost, operability, and trade-offs

Storage is the cost line that dominates everything else. Request and bandwidth costs matter but are usually secondary; the spreadsheet for an object store is mostly $/GB-month × total bytes.

Line	Estimate	Note
Hot-tier storage	$0.023 / GB-month	S3 Standard price. At 100 PB raw → ~$2.3M / month before erasure overhead.
Cold-tier storage	$0.001 – $0.004 / GB-month	Glacier / Deep Archive — 5–10× cheaper. Retrieval takes minutes to hours.
PUT / GET requests	$0.005 / 1K PUT, $0.0004 / 1K GET	At 100K PUT/sec → ~$1.3M / month. Real but smaller than storage.
Bandwidth out	$0.05 – $0.09 / GB	The famous egress markup. The "bandwidth pricing vs storage pricing" dance is a known cost optimisation — keep heavy traffic in-region.

Trade-offs and what's next

Strong vs eventual consistency. S3 went strong in 2020 by changing how listings are built — metadata shards serve lists directly. The cost is more load on the metadata plane; the payoff is that read-after-write applies to listings too.
Cross-region replication. Active-passive (async, RPO ≈ 15 min) is straightforward — a background pipe replicates new versions to a peer region. RPO depends on the pipe's throughput vs the write rate.
Multi-region active-active writes. Much harder than read replicas. Concurrent writes to the same key need conflict resolution — last-writer-wins on timestamp is the common compromise, with the honest caveat that clock skew loses writes.
Conditional writes. S3 added If-Match / If-None-Match in 2024 for optimistic concurrency. Lets clients do compare-and-swap without an external lock service. Closes a long-standing gap for databases-on-object-storage.
Erasure scheme as a tier. Hot data on 6+3 (low repair latency); archival on 10+4 (better overhead, slower repair). Picking one scheme for all tiers leaves money on the table.

Defending the decisions — interview pushback

Six choices an interviewer will probe. Have a one-sentence answer ready for each — the depth of these defenses, more than the diagram itself, is what separates a "passing" from a "strong-hire" signal.

"Why separate metadata and data planes?" Different SLOs, different access patterns, different scaling stories. Metadata is small, indexed, needs atomic updates. Data is large, sequential, immutable once written. Mixing them means you compromise both.
"Why erasure coding instead of 3× replication?" At petabyte scale, 3× means 200% overhead. A 6+3 erasure code costs ~50%. The math is straightforward once seen; at exabyte scale it's not optional.
"Why pre-signed URLs?" Application servers should never proxy bytes for large objects — it pins connections, double-pays for bandwidth, adds latency. The server mints a signed URL; the client uploads directly to storage.
"Why two-phase write — data first, metadata second?" Data first is the recoverable direction. Metadata first plus a data failure leaves a tombstone pointing nowhere. Data first plus a metadata failure leaves orphan blobs that GC reclaims. Always write the hard-to-recover thing last.
"Why a continuous repair scanner?" Disk failures happen constantly at fleet scale. The scanner reconstitutes missing shards from surviving parity. If repair rate falls below failure rate, durability silently degrades — and you don't notice until it's too late.
"Why lifecycle policies as first-class?" Most data goes cold after 30 days. Automatic tiering to cheaper storage is a 5–10× cost reduction per byte — a feature you build in, not an afterthought.

Object storage

1 · Clarifying questions

2 · Capacity math, on a napkin

3 · API and data model

Endpoints

Schema (metadata plane, sharded KV)

4 · High-level architecture

5 · Durability — erasure coding and placement

6 · Multipart upload

7 · Failure modes & runbook

8 · Cost, operability, and trade-offs

Trade-offs and what's next

Defending the decisions — interview pushback

Further reading

Twitter