S3, under the API.
S3 is the closest the cloud comes to a primitive of physics. It is a global flat-key store with eleven nines of durability, strong read-after-write consistency, and a request budget per key prefix that you can shard your way past. The API is tiny on purpose; the interesting things — the partitioner, the signed URL, the lifecycle evaluator, the analytics stack that grew on top — are what make it the data substrate underneath most of the public internet.
1 · What S3 actually is (and isn't)
The mental model that survives every conversation: S3 is a global hashtable mapping (bucket, key) → bytes. Buckets are globally unique. Keys are arbitrary strings up to 1024 bytes. Values are byte blobs from 1 B to 5 TB. The console pretends slashes are directories, but the namespace is flat — 2026/01/01/event.json is one key, not a path through a tree.
What S3 isn't: a filesystem. There is no rename (only copy-and-delete, which is two billed operations and isn't atomic from an observer's perspective). There is no append (PUT overwrites, never extends). There is no in-place edit (any byte you change replaces the whole object). There are no links. Treating S3 like an NFS share is the most common source of "why is my code slow / expensive?" puzzlement when teams first reach for it.
What S3 is also: two distinct surfaces with different scaling properties and pricing. The data plane handles GET/PUT/DELETE/HEAD on objects — this is the part that does billions of requests a second across AWS. The control plane handles CreateBucket, PutBucketPolicy, PutLifecycleConfiguration — orders of magnitude lower throughput, far stricter rate limits, and (until 2024) entirely separate from the data plane's strong-consistency guarantees.
| S3 is good for | S3 is bad for |
|---|---|
| Immutable blobs (images, video, logs, parquet, backups) | Anything you mutate frequently in place |
| Write-once / read-many access patterns | Lock-based concurrent writers (no native locking primitive — use DynamoDB or a real DB) |
| Large-object analytics (Parquet via Athena, Snowflake external tables) | Per-object <1KB hot reads at millions/sec (use DynamoDB or ElastiCache) |
| Storing data you want to retain for years cheaply | Mounting as a POSIX filesystem (s3fs and friends are slow and lie about consistency for metadata) |
| Cross-account / cross-org sharing via presigned URLs | Workflows that need atomic multi-object commits (use a database; or simulate via a manifest file) |
2 · How S3 is built — the public sketch
AWS doesn't fully publish S3's internals, but Werner Vogels' re:Invent talks, the SOSP 2021 paper on ShardStore (S3's storage engine), and various AWS Builders sessions give enough to draw a picture. The shape is consistent across the public material:
A PUT travels through a stateless front-end fleet that authenticates and routes, a keymap / index layer sharded by key-prefix range that maps (bucket, key) to physical placement, and a storage layer (ShardStore) that erasure-codes the bytes across many disks in multiple AZs. Background services do continuous integrity scrubs (S3 reads stored objects periodically and verifies checksums), re-replication when a disk dies, and the lifecycle evaluator that moves objects between storage classes. The control plane runs separately on a different fleet.
Two practical consequences fall out of this shape: writes are durable as soon as enough AZ-spread copies have been written (that's the eleven-nines number), and read/write throughput per key prefix is bounded by the index shard that prefix happens to live on — which is the thing the next section is about.
3 · Durability — what eleven nines actually means
S3 Standard advertises 99.999999999% annual object durability. Translated: if you store 10 million objects, the expected loss is one object every 10,000 years. The machinery behind that number:
- Synchronous multi-AZ writes. A PUT to S3 Standard isn't acknowledged until enough redundant copies / erasure-coded shards have been durably placed across multiple AZs. The exact replication factor isn't published, but the design point is "tolerate concurrent failure of an entire AZ plus additional disks elsewhere."
- Erasure coding. Larger objects aren't fully replicated — they're broken into shards with parity, so the storage overhead is ~1.5× rather than 3×. Any subset of shards above a threshold can reconstruct the object; surplus shards survive multiple disk losses.
- Continuous checksum verification. S3 scrubs objects in the background — reads them, verifies stored MD5/SHA-256 against fresh hash. Any divergence triggers re-replication from a known-good copy.
- End-to-end checksums on the wire. SDK clients can attach
x-amz-content-sha256on PUT; S3 verifies before ack. Bit-flips between client and storage aren't acked as success.
| Storage class | Durability | AZ count | Notes |
|---|---|---|---|
| S3 Standard / IA / Glacier IR | 99.999999999% | ≥3 | Eleven nines, default for new buckets. |
| S3 One Zone-IA / Express One Zone | 99.999999999% | 1 | Eleven nines within one AZ — annihilated if that AZ disappears. Use only for re-creatable data. |
| S3 Glacier Flexible / Deep Archive | 99.999999999% | ≥3 | Same durability, slower retrieval. |
| RRS (deprecated) | 99.99% | ≥3 | "Reduced redundancy" — discontinued for new buckets in 2019, no reason to pick it now. |
4 · Strong consistency, after 2020
S3 has provided strong read-after-write consistency for all new objects and overwrites since December 2020. A successful PUT means subsequent GETs return the new bytes; subsequent LISTs include the key. There is no eventual-consistency window. Engineers who learned S3 in the 2010s sometimes still avoid read-after-write patterns; you can stop avoiding them.
What's still eventually consistent: cross-region replication (CRR — propagation is best-effort within seconds, but not synchronous), bucket-level configuration changes (lifecycle, policy — can take minutes to apply), and the daily lifecycle evaluator (rules don't run in real time).
5 · The PUT lifecycle, end to end
A single PUT looks like one HTTP request but kicks off a small sequence under the hood:
Three things are worth noticing. First, the durability acknowledgement happens before the index commit — you can't have a "committed key" pointing at storage that hasn't been written. Second, the front-end is stateless; any replica can serve the next request. Third, after step 7, both LIST and GET in any region see the new object (strong consistency). The background scrubs and any CRR propagation are entirely separate from the user-visible PUT path.
6 · The per-prefix request-rate ceiling
S3 scales horizontally by sharding the index on key prefix. Each prefix supports at least 3,500 PUT/COPY/POST/DELETE per second and 5,500 GET/HEAD per second. There's no published cap on the number of prefixes per bucket, so the practical ceiling is "however many prefixes you want, times those numbers."
Old advice (pre-2018): prepend a random hex prefix to every key so writes spread across many shards from day one. Modern advice: structure keys with the high-cardinality field early and let S3's auto-partitioner do the rest. When a prefix range gets hot, S3 splits it into multiple shards within minutes — you usually don't need to randomise.
| Workload | Bad layout | Better layout |
|---|---|---|
| Event log, 100k writes/sec | events/2026/01/01/000… — everything pinned to one prefix | events/shard={0..31}/2026/01/01/000… — 32 independent shards |
| User uploads, 10k writes/sec across users | uploads/2026/01/01/{userId} — time-clumped | uploads/{userId-hash-prefix}/2026/01/01/} — spreads across hash space |
| Daily ETL output, 1k req/sec | Date-first is fine — well under ceiling | No change needed |
5xxErrors per bucket so sustained throttling shows up — that's the signal to reshape prefixes before the SDK retry budget runs out.7 · Multipart upload — the mechanics
Required for objects over 5 GB; recommended for anything over 100 MB. Three calls — CreateMultipartUpload, N × UploadPart (concurrent), CompleteMultipartUpload — let the client parallelise the upload across many TCP connections, resume after failures without restarting from byte zero, and stream data larger than the 5 GB single-PUT limit (up to 5 TB total).
Parts must be at least 5 MB except the last (no minimum). Up to 10,000 parts per upload. The client is responsible for tracking part numbers and their ETags and submitting them in order in the Complete call. If the client crashes between UploadPart and Complete, the parts sit in S3, billed as storage, until cleaned up.
AbortIncompleteMultipartUpload: DaysAfterInitiation: 7 — should be in your CDK/Terraform module from day one. AWS Cost Anomaly Detection has flagged this for thousands of accounts; check aws s3api list-multipart-uploads --bucket $B if you've never looked.8 · Presigned URLs — how signing the URL works
A presigned URL is a S3 endpoint URL with a SigV4 signature attached as query parameters that grants temporary, scoped access to one specific operation on one specific object. The signature is computed from your credentials, the HTTP verb, the canonical request, and a TTL; S3's front-end verifies the signature on the way in. The URL itself is the credential — anyone holding it can perform the signed operation until it expires.
The most common shape:
- Pre-signed GET. Server signs a GET URL for
s3://bucket/private-filewith a 5-minute expiry, returns it to the browser. Browser does the GET directly, S3 streams the bytes back. No server is in the data path. - Pre-signed PUT (browser uploads). Server signs a PUT URL for
s3://bucket/uploads/uuid. Browser does the PUT directly. The pattern behind every "drag a file into the page and it just uploads" — Vercel image uploads, Cloudflare R2 dashboards, Figma, Notion, anything that handles large files in a web app. - Pre-signed POST (form upload). Different mechanism — generates a form-encoded policy that bounds key, content-type, max size. Used when the client is a strict HTML form rather than JavaScript.
TTL pitfalls are real: tokens for STS-derived credentials (e.g., from a Lambda's execution role) can only sign URLs valid for the remainder of the session token's life. A 7-day pre-signed URL signed by a Lambda role that lasts 1 hour will be unusable after 1 hour, no matter what TTL you asked for. Sign long-lived URLs from a long-lived principal (IAM user, not a role) or accept the actual ceiling.
9 · Storage classes — pick by access pattern
| Class | $/GB/mo | Retrieval | Min duration | Reach for it when |
|---|---|---|---|---|
| Standard | $0.023 | Instant, free | None | Hot reads, anything < 30 days old |
| Intelligent-Tiering | $0.023 + monitoring | Instant | None | You don't know the access pattern; user uploads, app data |
| Standard-IA | $0.0125 | Instant, $0.01/GB | 30 days | Cooler hot data — backups, logs older than a month |
| One Zone-IA | $0.01 | Instant, $0.01/GB | 30 days | Re-creatable data only (caches, secondary replicas) |
| Glacier Instant Retrieval | $0.004 | Instant, $0.03/GB | 90 days | Archives you occasionally need now (compliance audit on demand) |
| Glacier Flexible | $0.0036 | Min–hours | 90 days | Backups for DR; quarterly review of logs |
| Glacier Deep Archive | $0.00099 | 12 hours | 180 days | Tape-replacement; 7-year retention |
The traps are in the corner cases. Standard-IA charges per-GB retrieval — copying 1 TB out of IA costs $10 in addition to standard egress. Glacier classes have a 128 KB minimum billable size per object — a million 1 KB objects in Glacier Deep Archive bill as if they were 128 MB. Aggregate small objects (e.g., tarball them) before sending to Glacier.
10 · Lifecycle, replication, event notifications
Lifecycle rules move objects between storage classes or expire them, evaluated by a daily background job — not in real time. A new rule can take 24–48 hours to first apply on existing objects. The canonical retention pattern: Standard → Standard-IA at 30 days → Glacier Flexible at 90 → expire at 365. Set this once on every bucket and the storage bill stops growing linearly with the retention window.
Cross-Region Replication (CRR) is asynchronous propagation of new objects to a bucket in another region. Used for DR, latency reduction, and compliance separation. Doesn't touch existing objects — for those, kick off a S3 Batch Replication job. Replication time is typically seconds for warm objects, minutes for cold. There's also SRR (same-region) for cross-account aggregation.
Event notifications turn S3 into the event bus underneath much of AWS. On object create / delete / restore / replication-failure, S3 publishes to a target — SNS (fan out to many subscribers), SQS (buffered, durable consumer), Lambda (run code), or EventBridge (rule-based routing across services). The "upload an image, Lambda thumbnails it" recipe is the canonical example; every analytics ingest pipeline starts here.
| Notification target | Best for | Watch out for |
|---|---|---|
| Lambda (direct) | Transform-on-upload (thumbnails, virus scan, metadata extraction) | Lambda concurrency = your S3 ingest rate. Lambda failures need a DLQ. |
| SQS | Decoupled batch processing, durable retry | FIFO is rarely needed; standard queue is the default. |
| SNS → fan-out | Multiple subscribers (Lambda + SQS + email) | Add filter policies to keep subscribers narrow. |
| EventBridge | Cross-service routing; "any S3 event matching X goes to Y" | EventBridge events have a 256 KB payload limit — fine for S3 notifications. |
11 · Access control — the layered model
S3 access is the union of five overlapping mechanisms, evaluated in a specific order. Understanding the order is the whole game when debugging "why can/can't this principal read this object":
- Block Public Access (BPA) — account-level and bucket-level trump cards (enabled by default since 2023). If BPA is on and a policy would allow public access, the request is denied regardless. The most common cause of "I made my bucket public but it still doesn't work."
- Explicit DENY in any policy — bucket policy, IAM policy, SCP, VPC endpoint policy. Any explicit DENY wins over any ALLOW. This is how organisations enforce "no public buckets, ever" even when individual teams misconfigure.
- Bucket policy — the resource-based policy on the bucket itself. The primary mechanism for cross-account access and for granting unauthenticated public read on, say, a static website bucket.
- IAM identity policy — what the principal is allowed to do. Combined with the bucket policy: the action must be allowed by both for in-account access (with a couple of exceptions for the bucket owner).
- ACLs (legacy) — pre-IAM mechanism for per-object permissions. AWS recommends disabling them via Object Ownership ("Bucket owner enforced") — the default for new buckets since 2023. Old code that sets per-object ACLs breaks on these buckets.
Two specialised tools sit on top: Object Lock (WORM mode — object can't be deleted or overwritten until a retention date, used for compliance) and Origin Access Control (OAC) (the recommended way to let only a specific CloudFront distribution read from a private bucket — uses SigV4 between CloudFront and S3, replacing the older Origin Access Identity).
12 · S3 Select, Object Lambda, and the analytics stack
S3 Select runs SQL against a single object — filter a 10 GB CSV/JSON/Parquet object server-side and return only matching rows. The point is to avoid pulling 10 GB across the wire to extract 10 MB. Useful, but narrow: one object per query, no joins.
Athena is the next layer: SQL over many objects in S3, no infrastructure to manage, billed per-TB scanned (~$5/TB). Combined with Glue (schema catalog + ETL), it becomes the de-facto "data lake on AWS" stack. Partition pruning via Hive-style key paths (year=YYYY/month=MM/day=DD/) is the whole performance story.
S3 Object Lambda intercepts a GET and runs a Lambda over the object's bytes before they reach the caller — for redaction, format conversion, image resizing, watermark insertion. Useful when you need different shapes of the same canonical object without storing each shape separately.
Beyond AWS's own surface, S3 is the storage layer for most cloud-native analytics: Snowflake, Databricks, Iceberg / Hudi / Delta Lake table formats, DuckDB with the httpfs extension. Each layer reads Parquet (or Iceberg metadata) directly from S3 — the database engine is decoupled from the storage and runs wherever you put it.
13 · Real-world case studies
Three public stories give a sense of how S3 actually shapes systems at scale.
Dropbox — Magic Pocket (2016). Dropbox had been built on S3 since launch. By 2015 they were storing exabytes and the AWS bill was a meaningful share of their cost structure. They built Magic Pocket, an in-house multi-exabyte object storage system, and moved ~90% of user data off S3 over two years. The interesting lesson isn't "leave S3" — most companies will never reach the scale where the economics flip — it's that S3's API became the shape Dropbox's replacement copied. They built a stratified erasure-coded storage system with a key-value front-end because that's the proven design point at this scale.
Netflix — Open Connect (2012–present). Netflix's CDN, Open Connect, serves video bytes from ISP-located appliances around the world. The canonical store for that content is S3 — encoded files are written to S3 first, then propagated to Open Connect appliances during off-peak hours. S3 is the source of truth; the appliances are a cache layer. This is the pattern for any "global delivery from one canonical store" — S3 handles the durability and consistency; you handle the propagation and caching close to users.
Snowflake — separation of storage and compute (2014–present). Snowflake's architecture stores all persistent data in S3 (or Azure Blob, or GCS — same shape on each cloud), with a stateless compute layer that scales independently. Query engines spin up, read columnar data from S3, do the work, and spin down. The reason Snowflake can grow compute and storage on separate curves — and the reason the same architecture is reproduced in Iceberg, Delta Lake, and every modern warehouse — is that S3-style object storage provides durable, infinitely-scalable, cheap blob storage that any number of compute readers can point at without coordination.
The through-line: S3 is most useful when treated as a durability and consistency primitive on top of which you build domain-specific systems, rather than as a generic filesystem replacement.
14 · Build it yourself — bucket, upload, lifecycle, presigned URL
- Create the bucket.
BUCKET=lab-s3-$(date +%s) aws s3api create-bucket --bucket $BUCKET --region us-east-1 aws s3api put-bucket-versioning --bucket $BUCKET --versioning-configuration Status=Enabled - Upload an object.
echo "hello s3" > /tmp/test.txt aws s3 cp /tmp/test.txt s3://$BUCKET/2026/01/01/test.txt aws s3api head-object --bucket $BUCKET --key 2026/01/01/test.txt - Set a lifecycle rule.
cat > /tmp/lifecycle.json <<'EOF' { "Rules": [ { "ID": "tier-then-expire", "Status": "Enabled", "Prefix": "", "Transitions": [ { "Days": 30, "StorageClass": "STANDARD_IA" }, { "Days": 90, "StorageClass": "GLACIER" } ], "Expiration": { "Days": 365 }, "AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 } } ] } EOF aws s3api put-bucket-lifecycle-configuration --bucket $BUCKET --lifecycle-configuration file:///tmp/lifecycle.json - Generate a presigned GET URL valid for 5 minutes.
aws s3 presign s3://$BUCKET/2026/01/01/test.txt --expires-in 300 # curl that URL — works even without AWS credentials - Try a multipart upload with the CLI sync.
dd if=/dev/urandom of=/tmp/bigfile bs=1M count=200 aws configure set s3.multipart_threshold 64MB aws configure set s3.multipart_chunksize 16MB aws s3 cp /tmp/bigfile s3://$BUCKET/big.bin --debug 2>&1 | grep -i "uploadpart" | head -3 - S3 Select on a tiny CSV.
printf "name,age\nalice,30\nbob,25\ncarol,35\n" > /tmp/people.csv aws s3 cp /tmp/people.csv s3://$BUCKET/people.csv aws s3api select-object-content --bucket $BUCKET --key people.csv \ --expression "SELECT s.name FROM s3object s WHERE CAST(s.age AS INT) > 28" \ --expression-type SQL \ --input-serialization '{"CSV": {"FileHeaderInfo": "USE"}, "CompressionType": "NONE"}' \ --output-serialization '{"CSV": {}}' /tmp/selected.csv cat /tmp/selected.csv - Tear it down.
# Versioning is on — must purge all versions before delete. aws s3api list-object-versions --bucket $BUCKET --query 'Versions[].{Key:Key,VersionId:VersionId}' \ --output json | jq -c '.[]' | while read v; do KEY=$(echo $v | jq -r .Key); VID=$(echo $v | jq -r .VersionId) aws s3api delete-object --bucket $BUCKET --key "$KEY" --version-id "$VID" done # Also purge delete markers aws s3api list-object-versions --bucket $BUCKET --query 'DeleteMarkers[].{Key:Key,VersionId:VersionId}' \ --output json | jq -c '.[]' | while read v; do KEY=$(echo $v | jq -r .Key); VID=$(echo $v | jq -r .VersionId) aws s3api delete-object --bucket $BUCKET --key "$KEY" --version-id "$VID" done aws s3api delete-bucket --bucket $BUCKET
15 · What breaks
- Bucket name already taken. Bucket names are global. Pick a unique suffix; the date / random pattern is conventional. (You can't ever get an early-2010s short name.)
- "Access Denied" on a fresh bucket. The default since 2023 is Block Public Access enabled at the account and bucket level. Even objects with public ACLs aren't reachable until you disable BPA. Don't disable it unless you actively serve a static website from this bucket and have audited the policy.
- "It's been hours and my object is still in Standard." Lifecycle transitions are not real-time — they run via a daily background evaluator. New rules can take 24–48 hours to first apply on existing objects.
- S3 Object Ownership disables ACLs. Default since April 2023 is "Bucket owner enforced" — per-object ACLs are disabled, everything goes through bucket policy. Old code that does
aws s3 cp --acl public-readerrors out. - Cost surprise on small objects in Glacier. Glacier storage classes have a 128 KB minimum billable size and (for Flexible / Deep Archive) a minimum 90/180-day storage duration — delete earlier and you still pay the full retention. Aggregate small files before archiving.
- SSE-KMS rate-limiting at high RPS. Every object PUT/GET makes a KMS API call. KMS quotas (5,500–30,000 rps per region) become the bottleneck for high-traffic buckets. Enable S3 Bucket Keys — it batches data-key generation and cuts KMS calls by 99%+.
- CRR doesn't replicate existing objects. Only objects written after replication was enabled. For the existing backlog, run S3 Batch Replication explicitly.
- "My presigned URL stopped working in 1 hour." If signed by a role-based session (Lambda, EC2 instance profile), the URL is bounded by the session token's expiry, not your
--expires-in. Sign long-lived URLs from a long-lived IAM user.
16 · Further reading
- S3 user guide. The canonical reference; the "best practices" pages on performance and security are the must-reads.
- Strong consistency announcement (Dec 2020). Retired the eventual-consistency warnings.
- "Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3" (SOSP 2021). AWS's own paper on the ShardStore storage engine and how they verify it.
- Dropbox — Inside Magic Pocket. Their multi-exabyte storage system; the "why we left S3" engineering story.
- Netflix — Open Connect content distribution. S3 as the canonical store underneath the world's largest video CDN.
- The Snowflake Elastic Data Warehouse (SIGMOD 2016). The canonical paper on storage/compute separation built on S3.
- LSM trees. S3 is, under the hood, a giant key-value store; LSM-tree internals are the engine-level primer.
- Cloud storage (concepts). Where S3 sits in the broader storage-primitives taxonomy.