S3 Prefix Sharding Simulator: when one bucket isn't one anything.
3500 PUT/s and 5500 GET/s per prefix. Cross the line and S3 returns 503 SlowDown. The auto-partitioner eventually splits the hot prefix — but only after sustained pressure.
Each card is one S3 prefix, which maps to one internal partition. The two bars inside it are live PUT and GET rates against the published ceilings — 3500 writes and 5500 reads per second — and a bar turns red the moment its rate crosses the line. The total PUT/GET sliders set how much traffic floods in; the simulator splits it evenly across every prefix you have, so adding prefixes is the same as buying headroom.
Press Start traffic, then push the PUT slider past about 14,000 against the four date-first prefixes. Each card holds at 3500, the overflow becomes 503 SlowDown, and the trace logs the hot prefix. Now wait. After 30 seconds of sustained pressure a hot prefix auto-partitions into two children, the partition count climbs, and the 503 ratio starts to fall. The surprise is the lag: the fix is automatic but never instant, which is exactly why a bursty workload still eats throttling before relief arrives, and why a shard-first layout that spreads load from the start beats waiting for the splitter.
What's an S3 prefix?
Everything up to a chosen byte position in the key.
An S3 prefix is everything in an object key up to some chosen byte position. The key 2026/01/15/event-2873.json has prefixes 2, 20, 202, 2026, 2026/, 2026/0, and so on — every left-anchored substring. S3 partitions a bucket internally by prefix; each partition has its own request budget; two keys that share a long common prefix likely live on the same partition.
The exact partitioning algorithm is internal to S3 and changes over time. What AWS publishes is the budget per partition: 3500 PUT/COPY/POST/DELETE per second, and 5500 GET/HEAD per second. Sustained traffic above those numbers gets back 503 SlowDown responses with a Retry-After header. Within minutes (or sometimes seconds), S3's auto-partitioner notices the heat and splits the hot prefix into two child partitions, each with its own budget — and the throttling clears.
This is the whole story of S3 prefix sharding. It's not about your bucket. It's about how S3 internally distributes work across its physical fleet, and how your key design dictates whether that distribution is even or pathological. The simulator above models the four prefix partitions, the request gauges per partition, the 503 throttling when a partition is overloaded, and the auto-partition split that eventually relieves the pressure.
The implication for designers is direct: if your key layout puts a low-cardinality field at the front (a year, a month, a timestamp), most of today's writes will share a prefix and pound a single partition. If you put a high-cardinality field early (a shard ID, a hash, a customer GUID), writes spread across many partitions automatically. The auto-partitioner can save you eventually, but not instantly — sustained pressure for 30 seconds to several minutes will still hit the budget before the split happens.
The published ceilings — 3500 and 5500
Per-prefix budget, not per-bucket.
AWS publishes two numbers that have stood almost unchanged since 2018: 3500 PUT/COPY/POST/DELETE per second per prefix, and 5500 GET/HEAD per second per prefix. These are not per-bucket caps. They're per-partition, and a bucket can have arbitrarily many partitions. The published advice — "use multiple prefixes" — is literal: spread writes across N prefixes and you have N × 3500 PUT/s of headroom.
There is no documented cap on the number of prefixes per bucket. Real buckets at scale have hundreds of thousands or millions of active prefixes; the auto-partitioner creates new ones as load demands. Customers who care about this — Netflix, Snowflake, Dropbox (before Magic Pocket), Pinterest, every video platform — routinely run buckets at 100k+ requests per second.
The asymmetry between PUT and GET reflects the underlying cost. A PUT involves durability — the object has to be replicated across multiple availability zones before the write returns. A GET serves from any one replica. Reads are cheaper, so the budget is higher.
Other operations have their own behaviour. LIST operations are throttled separately and are quite expensive; they should be avoided in hot paths. HEAD counts as a GET. Multipart-upload completion is a single PUT, but each part counts as its own PUT; a thousand-part upload counts as 1001 operations against your PUT budget.
| Operation | Per-prefix ceiling | Notes |
|---|---|---|
| PUT, COPY, POST, DELETE | 3500/s | durability path |
| GET, HEAD | 5500/s | read path |
| LIST | throttled separately | avoid in hot path |
| Multipart upload part | counts as PUT | 1000-part upload = 1001 PUTs |
| SSE-KMS GET / PUT | shares KMS regional limits | typically 5500/s without bucket key |
The "old advice" — random hex prefixes, and why it's outdated
Pre-2018 wisdom, post-2018 outdated.
Before July 2018, AWS publicly recommended that customers prepend a random hex string to every S3 key — typically the MD5 of the original key, or just a counter-based hash. The advice was correct under the partitioning model of the era: prefix partitions were essentially static, and a date-first key pattern (2018/05/01/event.json) would concentrate today's writes on a single partition, blow through the 3500 PUT/s budget, and stay throttled until traffic shifted to the next day.
The random-hex advice was painful: it destroyed the natural ordering of keys, made LIST output meaningless, broke many client-side tools, and made object-lifecycle expressions awkward. Operators tolerated it because the alternative was throttling.
In July 2018, AWS shipped what it called "Increased Request Rate Performance" — really, an automatic partition-splitting service that runs continuously in the background. The auto-partitioner detects hot prefixes (using internal metrics) and transparently splits them into two child partitions. The split takes some time (officially "within minutes"; in practice anywhere from 30 seconds to 10 minutes depending on load), but it's automatic. After the split, the new child prefixes each have their own 3500/5500 budget; the throttling clears.
The modern recommendation is no longer "randomise"; it's "use a key layout with high cardinality early in the key — but you don't have to artificially randomise". A shard ID, a hash of the user ID, a customer GUID, a UUID — anything that varies in the first few bytes — is enough. Date-first layouts still work for write rates below ~3000 PUT/s per day; above that the auto-partitioner will eventually catch up but you'll eat some 503s in the meantime.
For burst traffic that spikes to tens of thousands of writes per second in a few seconds — flash sales, breaking-news ingest, denial-of-service rebound — the auto-partitioner is too slow. In that case, prepend a few bytes of randomness (a 2-character hex shard prefix gives 256 partitions, each with the full 3500 PUT/s budget). The penalty in key readability is small; the savings in throttle latency are real.
The auto-partitioner — what it does and how fast
Within minutes, not real-time.
AWS doesn't publish the auto-partitioner's design, but the externally observable behaviour is consistent. When a prefix's request rate exceeds its budget for some sustained interval (probably tens of seconds), the system schedules a split. The split takes some time to complete — anywhere from 30 seconds to ten minutes depending on the size and load of the partition — during which 503s continue. Once it completes, the partition is replaced by two child partitions covering disjoint subsets of the original key range, each with its own 3500/5500 budget.
The split is transparent to the client. No URL changes; no DNS update; no application-visible event. From the application's point of view, 503s clear and throughput resumes. The new partitioning is invisible because S3 routes requests by hashing the key into the partition table, and the table is updated atomically.
What the auto-partitioner does not do: split partitions that are not hot. If your key layout creates millions of cold prefixes — say, one prefix per object — you get the request-budget benefit but waste storage metadata. If your layout creates one very hot prefix, the auto-partitioner will eventually save you, but you'll absorb 503s in the meantime. The system is designed for steady-state imbalance, not for instant burst absorption.
For predictable high-throughput workloads, the right thing to do is pre-partition: design your key layout so the first few bytes are a hash that spreads naturally across thousands of partitions from day one. The auto-partitioner then has nothing to do, and you never see 503s. Netflix's S3 layouts famously prepend a few hex characters of md5(content-id) for exactly this reason.
How throttling presents — 503 SlowDown and SDK retries
The error you (usually) never see.
When a partition's request budget is exceeded, S3 returns HTTP 503 SlowDown with a Retry-After header indicating how long the client should wait before retrying. The error body is XML with the error code SlowDown — distinct from a regular 5xx, which usually indicates a backend issue.
The AWS SDK handles 503s automatically. The default retry policy is exponential backoff with jitter, up to a configurable maximum (3 retries by default in most SDKs, up to 10 in some). For most applications, sustained 503s appear in CloudWatch as 5xxErrors spikes but never propagate up to application code — the SDK absorbs them.
That doesn't mean throttling is harmless. Each retry adds latency. A 503 followed by a 200ms backoff means your 50ms write turned into a 250ms write. Under sustained throttling, p99 latency degrades dramatically; SDKs running in parallel may exhaust their retry budget and surface the error to the application; throughput drops as the SDK queues up backed-off requests.
The monitoring story: CloudWatch's 5xxErrors metric on the bucket is the leading indicator. If it climbs above ~1% of total requests, you have a hot prefix. The bucket-level metric AllRequests divided by 5xxErrors gives the error ratio. S3 Storage Lens provides per-prefix breakdowns at additional cost; the metric you want is HotKeyErrorRatePercentage.
Designing keys for throughput — the layout reflex
High-cardinality field early.
The reflex for designing high-throughput S3 keys: put the high-cardinality field first. Whatever varies most across your write stream — a customer ID, a request ID, a shard hash — goes at the front of the key. Lower-cardinality fields (date, type, event class) come later. This single discipline is enough for most workloads.
Consider two layouts for the same data — say, event records partitioned by date and shard. Layout A is 2026/01/01/shard=03/event.json — date first. Layout B is shard=03/2026/01/01/event.json — shard first. Both store the same data with the same total cardinality. But layout A concentrates all of today's writes on a single root prefix (2026/01/01/) until the auto-partitioner catches up. Layout B distributes today's writes across as many root prefixes as there are shards — typically 4, 16, or 256.
The same logic applies for hash-prefix designs. If your write rate is unpredictable or bursty, prepend md5(content-id)[:2] as the first 2 characters: 7a/customer=42/2026/01/01/event.json. That gives you 256 root prefixes from day one. Each with its own 3500/5500 budget. Total bucket throughput: roughly 900k PUT/s, 1.4M GET/s. No auto-partitioner involved.
The penalty for hash prefixes is that LIST output is no longer ordered by anything meaningful. If your operational workflows depend on listing keys by date or by customer, you'll need a separate index (DynamoDB, an Athena table, an external catalog) to recover that ordering. For pure write-and-read-by-key workloads, the hash prefix is free.
| Layout | Today's hot prefix | Effective PUT/s | Notes |
|---|---|---|---|
| 2026/01/01/* | 1 partition | 3500 | auto-partitions eventually |
| shard=N/2026/01/01/* | N partitions | N × 3500 | no random key bytes; LIST still ordered |
| md5/2026/01/01/* | ~256 partitions | ~900k | LIST output unordered; need external index |
| UUID/* | ~thousands | millions | maximum spread; no natural ordering |
The throttle signal — CloudWatch metrics that matter
What to watch, what to alarm on.
S3 emits several CloudWatch metrics that together indicate request-rate health. The primary signal is 5xxErrors at the bucket level — if you opt in to request metrics (it costs extra). This metric counts every 503 the bucket returned, summed across all prefixes. Anything above 0.1% of AllRequests is worth investigating; above 1% is an active incident.
The complementary signal is FirstByteLatency and TotalRequestLatency. Throttling indirectly increases these because SDK retries take real wall time. A bucket with steady-state latency p99 of 50ms that climbs to 500ms is almost certainly experiencing 503s on the prefix; the latency reflects the backoff time, not the underlying disk.
For finer-grained analysis, S3 Storage Lens (paid) provides per-prefix breakdowns including HotKeyErrorRatePercentage, RequestRate, and PrefixCount. Most operators turn this on only when investigating a specific incident; the cost ($0.20 per million objects per month for advanced metrics) adds up for large buckets.
The classical operational pattern: alarm on 5xxErrors > 0.5% at the bucket level. When fired, pull S3 Storage Lens for the affected bucket to identify the hot prefix. Adjust key layout (or add shard prefixes) at the application layer. Wait for the auto-partitioner to catch up; if traffic stays above the threshold for hours after the layout change, escalate to AWS support with the bucket name and the timing.
Real numbers — what production looks like
The ceilings rarely bite when key design is right.
Public talks and engineering blogs give a sense of the actual numbers production S3 buckets push. Netflix processes 100k+ requests per second into a single S3 bucket as part of its open-connect content distribution pipeline, with hashed prefixes that spread across thousands of partitions. Snowflake's storage layer is S3; a large Snowflake deployment can sustain hundreds of thousands of micro-partition reads per second per warehouse.
Pinterest serves billions of image PUTs per day, with a key layout that prepends a 2-hex-character hash for spread. Discord's message-attachment service shards across thousands of S3 prefixes per second of upload traffic. Dropbox built Magic Pocket because at exabyte scale even S3 economics tilt; until then, its file storage ran on S3 with ~30TB/day uploaded.
What's notable about these workloads is what they're not doing: they're not flailing against the 3500/5500 ceilings. They designed their key layouts up front, spread their writes across many partitions, and never see the limits. The ceilings are designed to be invisible at scale — they bite only when key design is naive.
The exception is bursty workloads. The classical anti-pattern is a sudden flash-sale or breaking-news event that drives ten thousand writes per second into a key prefix that was previously cold. The auto-partitioner takes minutes to catch up; during that minute the customer sees 503s. Pre-partitioning (forcing a hash prefix into the key layout from the start) is the only mitigation; runtime detection is too slow.
What else throttles in S3 — multipart, replication, KMS
The other limits worth knowing.
Multipart upload counts as multiple operations. Each UploadPart call is a PUT against the prefix budget. A 1000-part upload (uploading a 100 GB file in 100 MB chunks) is 1001 PUTs from the bucket's point of view: 1 to start, 1000 parts, and the multipart-complete which doesn't count separately. If you upload many large files concurrently, each multipart upload contributes to the prefix's PUT rate.
Cross-region replication consumes throughput on both source and destination buckets. Replication is asynchronous and reads from the source then writes to the destination — both within their respective prefix budgets. A bucket with active replication and high write rate effectively doubles its read traffic (the replicator's reads) and creates write traffic on the destination at the source's rate.
SSE-KMS rate limits are usually the binding constraint long before S3's per-prefix limits. Each PUT or GET on an SSE-KMS object requires a call to KMS to encrypt or decrypt the data key. KMS has a regional limit (typically 5500 to 30000 requests per second depending on region and key type). At sustained high request rates, KMS throttles before S3 does.
S3 Bucket Keys mitigate the KMS pressure. With Bucket Keys enabled, S3 caches the KMS-derived data key per bucket for short periods; only a fraction of object operations require a KMS call. This typically reduces KMS request volume by 99%, eliminating KMS as a bottleneck.
Transfer Acceleration is a separate concept. It routes uploads through the closest CloudFront edge to the bucket, reducing round-trip latency. It doesn't affect per-prefix limits — the bucket-side throughput is still subject to the same 3500/5500 budget.
Edge cases — Bucket Keys, Transfer Acceleration, S3 Express One Zone
Where the published ceilings break down.
S3 Bucket Keys change the KMS calculus dramatically. Without Bucket Keys, every SSE-KMS PUT or GET requires a KMS Encrypt/Decrypt call — typically capped at 5500/s per region. With Bucket Keys enabled (free to enable, must be set per-bucket), S3 caches the data key locally for short periods; KMS request volume drops by ~99%. For any bucket with SSE-KMS and high request rate, Bucket Keys should be on.
S3 Transfer Acceleration routes uploads through CloudFront edges to the nearest AWS POP and then over AWS' private backbone to the bucket. It improves upload latency for clients far from the bucket's region. It doesn't change per-prefix request limits; the bottleneck is still the bucket-side partition.
S3 Express One Zone (launched November 2023) is a separate storage class with different limits — single-AZ but with single-digit-millisecond first-byte latency and per-prefix limits an order of magnitude higher (configurable, but tens of thousands of operations per second per directory bucket). It's a separate product designed for ML training and latency-critical workloads; the per-prefix discussion above doesn't apply.
Versioning-enabled buckets behave the same way for request rate, but each PUT creates a new version (doubling effective storage), and DELETE creates a delete marker rather than removing the object. Versioning is independent of prefix sharding.
Object Lock doesn't affect request rate either, but does prevent deletes for the lock duration. PutObject with Object Lock metadata counts as a normal PUT against the prefix budget.
Tunable knobs — what AWS exposes
Per-bucket configuration that matters for throughput.
Request Payer (Requester Pays) shifts the per-request cost from the bucket owner to the requesting account. It doesn't change throughput characteristics, but it lets bucket owners host high-volume datasets without bearing the request bill — useful for public datasets and academic data lakes.
S3 Bucket Keys, mentioned above — turn it on for any bucket with SSE-KMS and meaningful request volume. The setting is per-bucket; existing objects don't migrate automatically but new PUTs use the cached data key.
Transfer Acceleration is a per-bucket toggle; it adds a small cost per GB uploaded ($0.04/GB to US/EU, more for distant regions). For latency-sensitive uploads from globally distributed clients, the cost is usually justified.
S3 Storage Lens provides per-bucket and per-prefix request metrics at additional cost ($0.20 per million objects per month for advanced metrics). For any bucket pushing more than a few thousand requests per second, Storage Lens is the only way to diagnose hot prefixes.
S3 Intelligent-Tiering doesn't affect request rate but does affect cost — for buckets with high request rate the per-request charge ($0.005 per 1000 PUTs, $0.0004 per 1000 GETs) often dominates the storage cost. For 100k req/s, that's $40k/month in request charges alone, which dwarfs the storage bill for most buckets.
Further reading on S3 internals and prefix sharding
Primary sources, in order.
- AWS docsBest Practices Design Patterns: Optimizing Amazon S3 PerformanceThe authoritative AWS document. Per-prefix ceilings, parallelisation strategies, retry behaviour.
- AWS blog · July 2018Amazon S3 Announces Increased Request Rate PerformanceThe post that obsoleted the random-hex advice. Confirms the per-prefix budget model.
- Pelkonen et al · FAST 2021Building and Operating a Pretty Big Storage System (ShardStore)AWS engineers' SOSP/FAST paper on ShardStore, the storage engine underneath modern S3. Describes the placement and sharding model in detail.
- Verbitski et al · SIGMOD 2018Aurora's Storage Layer LessonsAdjacent: how Amazon Aurora's storage shards across nodes. Same design principles, different surface.
- AWS blogBest Practices for S3 Storage LensHow to use Storage Lens to diagnose hot prefixes and tune key layouts.
- AWS docsTroubleshooting Amazon S3 by Symptom503 SlowDown diagnostic flow. The first link to share with on-call during an incident.
- Netflix Tech BlogNetflix and S3 — operational notesMulti-part Netflix blog series on how their pipelines push 100k+ req/s to S3 buckets. Key design and CloudWatch alarms.
- AWS re:Inventre:Invent talks on S3 architectureAnnual deep dives by S3 distinguished engineers. STG304/STG343 sessions across years cover the partition model and request-rate engineering.
- Bornholt et al · SOSP 2021Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3The S3 team's SOSP 2021 paper on using formal methods to validate ShardStore. Adjacent reading for understanding correctness guarantees.
- Semicolony codexS3 internalsThe longer-form companion piece on S3's architecture, durability model, and operational characteristics.
- Semicolony simulatorDatabase shardingSame problem, different surface — hash sharding, range sharding, consistent hashing. Adjacent reading.
- Semicolony simulatorLoad balancerHow clients spread requests across many backends. The application-layer counterpart to S3's storage-layer sharding.