Database Sharding Simulator: three ways to split a database.
Sharding splits one table's rows across several databases so no single machine holds everything, and the shard key you pick decides how evenly the load spreads. Compare range, hash, and directory sharding: same rows, very different distribution shapes, and very different costs when you add a shard.
You pick a strategy and a shard count, then add rows. Each row is a random user and
region; the strategy decides which shard it lands on. The distribution bars show how
many rows each shard holds, and the recent-inserts log names the shard every row was
routed to. Hash runs fnv1a(user) mod N, range buckets rows by id (25 per
shard), and directory maps each region to a shard through a lookup table.
Start on hash and click Add 20 a few times: the bars stay roughly level because the hash spreads users evenly. Now switch to range and reset. Every new row carries the next id, and ids 1 to 25 all map to shard 0, so the first batch piles entirely onto one shard while the rest sit empty. That is the monotonic-key hot spot in miniature. Then try directory with eight shards: only four shards ever fill, because there are just four regions to map. The lesson the bars make obvious is that even row counts and even traffic are different things, and the strategy you pick decides which one you get.
What is database sharding?
One database, eventually too small.
Database sharding is horizontal partitioning of data across multiple database instances, where each shard owns a disjoint subset of the rows. The technique predates its current name — Google's GFS (2003) and BigTable (2006) papers used it; Friendster, MySpace, Flickr, and YouTube all sharded their MySQL fleets through the late 2000s. The simulator below lets you push on the three canonical strategies — range, hash, and directory partitioning — and see which keys move when you split or merge a shard.
Imagine a small social product that has grown into a medium-sized one. The product started with a single Postgres database. The schema is reasonable; the indexes are sensible; everyone on the team knows where the rows live. For three years it just worked. Then growth hit a corner: the user table now holds 8 billion rows, the index pages no longer fit in RAM, every query that misses the buffer pool turns into a disk seek, and the p99 latency on the home-feed endpoint has crept from 40 ms to 380 ms. The on-call rotation is becoming unpopular.
You can buy a bigger machine. Modern cloud instances offer 1.5 TB of RAM and 192 vCPUs in a single box, and a vertical bump will get you another year. After that the bump stops working. There is no 3 TB instance. Even if there were, the entire dataset is now one failure domain — one bad query, one corrupt page, one storage controller fault, and the whole product is offline. And every replica you spin up to handle reads is also a full copy of every byte: you pay for that storage many times.
Sharding is the answer to this very specific problem. Take the one big database and split it horizontally — every row still lives in exactly one place, but rows for different customers (or different IDs, or different time-windows) live on different machines. Eight servers each hold one billion rows. Eight servers each fit in RAM. A failure on shard three takes out one eighth of traffic instead of all of it. Adding capacity becomes a matter of adding shards. The catch is that any operation that needs to read or write across more than one shard now has to do extra work, because the database engine no longer sees the full table on one machine.
The choice of how to split is the load-bearing decision. Range sharding (rows 1–1M on shard A, 1M–2M on shard B) makes ordered scans easy but tends to put all new writes on the last shard. Hash sharding (hash the user ID, mod by shard count) spreads load evenly but makes range queries hit every shard. Directory sharding keeps an explicit "user X lives on shard Y" lookup table — flexible, but the table itself becomes a hot dependency. The simulator above lets you watch all three on the same dataset; the bars show how lopsided the load gets when the strategy is mismatched to the access pattern.
Concretely: a single Postgres on a beefy box handles maybe 50–80k QPS of mixed OLTP. Eight well-sharded Postgres instances handle 400k. Eight thousand Cassandra nodes handle tens of millions. The technique scales with the count, not with the size of one machine. The price is paid in operational complexity: backups, schema migrations, cross-shard joins, and especially resharding all become projects in their own right. Spend the rest of this page learning what those projects look like before they show up in your queue.
Sharding strategies — range, hash, and directory partitioning
Range, hash, directory.
Range sharding assigns each row to a shard based on where its sort-key value falls in a sequence of partition boundaries. Customer IDs 1–100 000 to shard 0, 100 001–200 000 to shard 1, and so on. The merit is that range scans on the partition key remain efficient — a query for "all rows between X and Y" hits at most a contiguous handful of shards. The hazard is hot spots: an auto-incrementing primary key always inserts into the highest-numbered shard, leaving everything else cold. HBase, Google Bigtable, CockroachDB, and YugabyteDB all use range sharding by default; CockroachDB additionally splits ranges automatically once a range exceeds 512 MB.
Hash sharding assigns each row to a shard by hashing the partition key and taking the result modulo the shard count, or by mapping it onto a hash ring. The distribution is statistically uniform — across a population of millions of keys you expect each shard to receive within a few percent of an equal share. The cost is range-scan amplification: any query that needs ordered output by partition key must hit every shard. Apache Cassandra (originally Facebook 2008, later Apache, now also DataStax) uses a Murmur3-based hash ring with virtual nodes (vnodes); DynamoDB (Amazon, public since 2012) uses MD5 truncated to 64 bits over a partition key; Vitess (YouTube, open-sourced 2012, the engine behind PlanetScale) uses configurable vindexes that include hash, lookup, numeric, and unicode-loose-md5 variants. ScyllaDB rewrites Cassandra in C++ on the same protocol; Riak (Basho, now defunct) used consistent hashing throughout its lifetime.
Directory-based sharding stores an explicit lookup table that maps each key (or each tenant, or each region) to a shard ID. The advantage is total flexibility — you can co-locate related rows, bias toward geographic affinity, or rebalance one tenant at a time without touching others. The cost is a hot dependency: the directory itself becomes a single point of failure and a per-query lookup. Foursquare's 11-hour outage on October 4, 2010, traced to a MongoDB shard exceeding RAM and the routing layer cascading, became the canonical cautionary tale; Slack's 2017 directory-based-tenant sharding writeup by Bing Wei is a cleaner example of the pattern done well.
| Strategy | Routing | Wins at | Loses at |
|---|---|---|---|
| Range | Sort key in [a, b) | Range scans, time-series. | Monotonic IDs (last shard hot). |
| Hash | hash(key) mod N or ring | Even distribution, point lookups. | Ordered scans (rows scattered). |
| Directory | Lookup table (key → shard) | Co-location, multi-tenant geo. | Directory itself is a system. |
Origins of sharding — older than the term itself
Older than the term itself.
Sharding is the partitioning of a logical dataset across multiple physical databases so each machine holds a slice. The technique is older than the term. IBM's mainframe shops in the 1970s split customer ledgers across machines by customer ID range; Tandem's NonStop SQL (1987) was an early commercial system designed from inception to partition tables across nodes; Teradata (founded 1979) built its entire data-warehouse business on hash-partitioning fact tables. The web-scale era did not invent sharding; it inherited it, then renamed it.
The pressure to shard arrives along three independent axes. Storage: a single Postgres or MySQL instance comfortably holds tens of terabytes today, but when a table grows past tens of billions of rows the index pages no longer fit in RAM and every miss becomes a disk seek. Throughput: a single primary tops out at roughly 50–100k QPS for write-heavy workloads, far less for analytical queries; sharding distributes the write rate. Blast radius: a single primary is a single failure domain. Splitting one table across 32 shards means a hardware failure or runaway query takes out 1/32 of the workload, not all of it.
Sharding always trades simplicity for scale. A non-sharded database supports any join, any transaction, any index, any query plan, with familiar semantics. A sharded database supports the queries that fit its partitioning scheme cheaply and pays a 10–100× cost for queries that don't. The choice of sharding strategy is, in effect, a choice about which queries you intend to optimise and which you accept will become slow or impossible.
The literature acknowledges this directly. The CAP theorem (Brewer 2000, formalised by Gilbert and Lynch 2002) frames the underlying tension between consistency, availability, and partition tolerance. Pat Helland's 2007 paper Life Beyond Distributed Transactions at CIDR is the canonical text for what changes when you give up cross-shard ACID. Every modern sharded system is, at heart, an operational answer to Helland's question: what do you do with the queries that no longer fit in one place.
The hot shard problem — hash distributes keys, not traffic
Hash distributes keys. It does not distribute traffic.
Hash sharding gives perfect key distribution. It says nothing about access frequency. If 1% of your customers generate 50% of the traffic, the shards holding those customers run hot while the others sit idle. Worse, hash sharding hides this: the keys look uniform, so dashboards built on row counts may not show the imbalance until a shard topples and the on-call channel lights up. The literature calls this the celebrity problem, after the canonical example of one Twitter user with 200 million followers whose row generates more read traffic than entire shards' worth of normal users combined.
Mitigations come in three layers. First, read-aside caching: a Memcached or Redis layer in front of the hot shard absorbs the read multiplier. Twitter's Manhattan architecture (2014) moves hot timelines into dedicated cache fleets entirely, breaking the hot-key dependency on the underlying database. Second, key salting (sometimes called write fan-out): split a hot logical key into N physical sub-keys (hot_key:0, hot_key:1, ..., hot_key:N-1), distribute writes across them, and aggregate at read time. Useful for counter writes, vote tallies, view counts. Third, per-key replication: place the hot key on multiple replicas and route reads round-robin. Cassandra's read repair and DynamoDB's adaptive capacity automate variants of this.
Detection matters as much as mitigation. Discord's 2017 engineering blog described how a single guild with two million members caused a chain of replica failures because the team had not instrumented per-shard tail latency. The fix involved sampling slow queries to identify the offending key, then rewriting the schema to bucket guild messages by (guild_id, time-window) instead of (guild_id) alone — converting one hot key into many cooler ones. Discord's later 2023 post about migrating their trillions-of-messages workload from Cassandra to ScyllaDB is essentially the same playbook applied at one more order of magnitude.
Hot keys are not exclusively a hashing problem. Range-sharded databases hit them whenever the application's natural sort order produces clustered access — auto-incrementing IDs, timestamps, geographic codes that share a prefix. CockroachDB and Spanner mitigate this with periodic load-based splits that subdivide a range whose CPU usage exceeds threshold, regardless of its key distribution. The technique works but adds a feedback loop the operator must understand to debug.
Hash sharding gives perfect key distribution. Traffic is a different distribution. Per-shard tail latency is the metric you must alert on; row count is not. Discord, Twitter, and Twitch all wrote postmortems whose root cause was "we measured rows, not requests-per-row".
Resharding — the hardest operation, plan for it on day one
The hardest operation. Plan for it on day one.
Shard count rarely stays the same forever. Adding shards naively under modulo hashing reshuffles every key — a change from N to N+1 shards moves roughly N/(N+1) of all rows. Consistent hashing (Karger, Lehman, Leighton et al, STOC 1997, originally for web cache routing at Akamai) reduces the move set to about K/N rows where K is the dataset size. Jump consistent hash (Lamping and Veach, Google, 2014) achieves the same bound in seven lines of code with no auxiliary data structure; it is the algorithm used inside Spanner's tablet routing.
Most large production systems do not rely on hashing alone. They use logical shards (also called vshards, virtual shards, or tablets): pick a large fixed count — 1024 is conventional — of logical buckets, hash keys into them at write time, and maintain a separate table mapping logical bucket to physical machine. Adding a machine becomes a vshard reassignment plus a row migration; the hash never changes. Vitess calls these resharding workflows, runs them with a tool called vstreamer, and double-writes to old and new shards during cutover. PlanetScale exposes the same primitive as branch-and-merge for schema migrations.
The sequence of an online reshard is the same in every system. Stage one: provision the new shard, replicate from the source, double-write incoming traffic to both old and new. Stage two: catch the new shard up via change-data-capture (Debezium, Maxwell, Vitess VStream, or the database's own logical replication) until lag is below a chosen threshold. Stage three: cut reads over to the new shard for a sample of traffic; observe; widen the cutover. Stage four: decommission the old shard once writes have stopped and replication has drained.
Schema migrations on a single shard share the same online flavour. Tools like gh-ost (GitHub's online schema migrator, open-sourced 2016) and pt-online-schema-change (Percona) build a shadow table, copy rows in the background, replay binlog deltas, and atomic-rename. Vitess and Postgres pg_repack do similar tricks. The combination — gh-ost for column changes, vstreamer for shard count changes — is the modern operator's toolkit. The alternative (downtime, dump, reload) becomes increasingly unacceptable above a few hundred GB.
The operations team must also manage backfill: when a shard split runs, every row of the source shard must be physically copied to the destination. At terabyte scale this can take days, during which the cluster runs hotter than steady state. Most teams pace the backfill with a token bucket, watching replication lag and CPU headroom; Vitess exposes a --max_replication_lag flag, CockroachDB's SPLIT AT and SCATTER commands let an operator pre-split and pre-distribute ranges before the load arrives. The single most important operational principle is to make the migration interruptible — if you cannot pause or roll back at any stage, you cannot afford to run it.
Cross-shard transactions and joins — what you give up
Transactions that cross the boundary.
The single biggest cost of sharding is what happens when a query or transaction needs to touch more than one shard. A single-shard read or write is functionally identical to an unsharded database. A two-shard transaction needs two-phase commit (2PC) or one of its descendants: prepare on every participant, commit only if all said yes, abort otherwise. 2PC is correct but blocking — a coordinator failure between prepare and commit leaves participants holding locks, sometimes for minutes. Spanner solved this in 2012 (Corbett et al, OSDI) with synchronised TrueTime clocks and Paxos-replicated participants, but TrueTime requires GPS antennas and atomic clocks in every datacenter. Most systems cannot afford that infrastructure.
The widely-deployed alternative is the Saga pattern (Garcia-Molina and Salem, SIGMOD 1987, rediscovered by the microservices community around 2015). Decompose a long-running transaction into a sequence of local transactions, each with a compensating action. If step three fails, run the inverse of step two and the inverse of step one. The result is eventually consistent rather than ACID, and the application has to be designed to tolerate intermediate states — but the operational profile is benign.
Calvin (Thomson, Diamond, Weng et al, SIGMOD 2012) takes a third approach: pre-order all transactions globally before they run, then execute them deterministically across shards. The technique appears commercially in FaunaDB and influenced FoundationDB's transactional layer (Apple, originally KV-only, acquired and open-sourced 2018). FoundationDB powers iCloud's underlying storage and Snowflake's metadata service.
Cross-shard joins follow analogous patterns. Co-location shards related tables on the same axis (orders and order_items both keyed by customer_id) so joins remain single-shard. Replication duplicates a small dimension table to every shard so joins to it become local. Scatter-gather queries every shard in parallel and merges results in the application — fine for ten shards, painful at a thousand. Citus (Postgres extension, acquired by Microsoft 2019, now Azure Cosmos DB for PostgreSQL) automates all three patterns; ElasticSearch's index-and-shard model uses scatter-gather by default and exposes routing to force co-location when the access pattern allows.
Idempotency becomes a load-bearing concern in any sharded system. A retry that lands on a different replica, a saga compensation that runs twice, a 2PC participant that recovers from a crash mid-protocol — all force the application to handle "this operation may have already happened" gracefully. Stripe's idempotency-key pattern, GitHub's similar header convention, and the broader REST community's drift toward idempotent endpoints exist precisely because cross-shard infrastructure cannot promise exactly-once delivery. The cleanest production designs treat every cross-shard write as if it might be retried indefinitely, and write the corresponding logic into the service layer.
Sharding in production — Vitess, Citus, MongoDB, DynamoDB
Who shards how, at what scale.
Pinterest's MySQL sharding (engineering blog 2015, expanded 2020) sits 8192 logical shards across roughly 100 physical MySQL primaries. Every primary key is a 64-bit integer whose top bits encode the shard, so the application can route without a lookup. Cross-shard joins are forbidden by convention; the schema is denormalised aggressively. The system has handled hundreds of billions of pins for a decade with no fundamental architectural change.
Instagram's IDs (engineering blog 2012) embed the shard ID, a millisecond timestamp, and a per-shard sequence into a single 64-bit value, making every primary key self-routing. The format has been copied verbatim by enough teams that it has become a community pattern under the name "snowflake IDs", borrowed from Twitter's similar 2010 design.
Discord's message store migrated from MongoDB to Cassandra in 2017, then from Cassandra to ScyllaDB in 2023 (engineering blog by Bo Ingram). The story they published is unusually candid about hot-shard pain: a single very-active guild concentrated enough write traffic to repeatedly trigger Cassandra GC pauses; the migration to ScyllaDB (C++, no GC) plus a re-keying that bucketed messages by (channel, time-window) flattened the load. The new system stores trillions of messages and serves single-digit-ms p99 reads.
Vitess, the engine inside YouTube's MySQL fleet since 2010 and the basis of PlanetScale's commercial product, has been the reference implementation for online resharding since its 2012 open-source release. It currently handles billions of QPS across YouTube and powers production workloads at Slack, Square, GitHub, and Etsy. CockroachDB (Cockroach Labs, founded by ex-Googlers in 2015) implements the Spanner architecture without TrueTime, using hybrid logical clocks. Citus turns Postgres into a shard-aware cluster with familiar SQL semantics. Spanner remains the most ambitious of the lot: globally consistent, externally consistent transactions, real GPS-and-atomic-clock infrastructure, deployed across every Google region.
# Vitess vschema.json — sharded keyspace
{
"sharded": true,
"vindexes": {
"hash": { "type": "hash" },
"user_md5": { "type": "unicode_loose_md5" }
},
"tables": {
"users": { "column_vindexes": [{ "column": "id", "name": "hash" }] },
"orders": { "column_vindexes": [{ "column": "user_id", "name": "hash" }] }
}
}
-- Citus (Postgres) equivalent:
SELECT create_distributed_table('users', 'id');
SELECT create_distributed_table('orders', 'user_id', colocate_with => 'users');When NOT to shard — vertical scaling and read replicas first
Vertical scaling first.
The honest answer to "should we shard?" is "probably not yet". Modern hardware has moved the threshold higher than most architects intuit. A single AWS r7i.48xlarge or GCP m3-megamem-128 has 1.5+ TB of RAM and 192 vCPUs; Postgres 16 on that host comfortably handles 50–80k QPS for mixed OLTP at sub-ms p99 with no replication. Aurora, Cloud SQL, AlloyDB, Neon, and Supabase all push the same envelope without requiring application-level sharding. Read replicas multiply read capacity; writes remain single-primary, but the headroom is enough for most products through their first $100M of revenue.
Premature sharding has a notable failure mode. A team shards because they expect to grow, builds an application around the shard key choice, then discovers the access pattern that actually emerged is not the one they sharded for. Re-sharding a live system is one of the most expensive operations in engineering; doing it because you sharded before you needed to is a waste of two years.
The corollary: when sharding does become necessary, choose the shard key by surveying actual production query patterns, not by guessing. Tools like pg_stat_statements, MySQL's performance_schema, and APM tools (Datadog, New Relic, Honeycomb) reveal which queries dominate the workload. The shard key should be the join key those queries use most. Sharding by user_id when most queries are by tenant_id is the canonical mistake; the symptom is a sharded database whose performance is worse than the unsharded version it replaced.
Adjacent technologies sometimes substitute for sharding entirely. Distributed SQL (Spanner, CockroachDB, Yugabyte, TiDB) gives you ACID across nodes at the cost of higher write latency. Columnar warehouses (Snowflake, BigQuery, ClickHouse, Redshift) handle analytical workloads sharding-free for the user — they shard internally, but the operator does not see it. Search engines (Elasticsearch, OpenSearch, Solr) are sharded by design and a perfectly reasonable substrate for some OLTP-shaped workloads. The right answer is sometimes "use a different system that has already solved this", not "shard your existing one".
One last principle worth internalising: sharding is rarely reversible. Once an application has been written to assume row co-location, sharded indexes, and per-shard transactions, walking it back to a single primary is a bigger project than the original split. Choose the cutover point deliberately. Document the shard key, the rationale, and the queries it optimises in a place future maintainers will find. Build a runbook for adding a shard before you need to, and exercise it annually on a non-production replica — the worst time to discover a tooling gap is during a capacity incident at three in the morning.
Further reading on database sharding
Primary sources, in order.
- Corbett et al · OSDI 2012Spanner: Google's Globally Distributed DatabaseThe TrueTime paper. Synchronised clocks, Paxos groups, externally-consistent transactions across continents.
- Thomson et al · SIGMOD 2012Calvin: Fast Distributed Transactions for Partitioned Database SystemsThe deterministic-ordering alternative to 2PC. The intellectual ancestor of FaunaDB and a useful counterpoint to Spanner.
- Karger et al · STOC 1997Consistent Hashing and Random TreesThe original consistent-hashing paper, written for Akamai's web-cache routing problem. Foundational.
- Vitess docsSharding in VitessYouTube's open-sourced MySQL sharding layer. Documents what online resharding actually looks like — vstreams, vindexes, cutover.
- Discord Engineering · 2023How Discord stores trillions of messagesThe Cassandra-to-ScyllaDB migration writeup. Honest about hot keys, GC pressure, and the rekey that flattened the load.
- Pinterest Engineering · 2015Sharding Pinterest: How we scaled our MySQL fleetThe 8192-logical-shards design that has held up for a decade with no architectural change.
- Instagram Engineering · 2012Sharding & IDs at InstagramThe 64-bit self-routing ID design every team copies. Shard, timestamp, sequence — all in one integer.
- Martin Kleppmann · bookDesigning Data-Intensive ApplicationsChapter 6 ("Partitioning") is the standard textbook treatment of sharding strategies, rebalancing, and routing. Reads well as student-friendly prose.
- Alex Petrov · bookDatabase InternalsPart II covers distributed-database fundamentals — partitioning, replication, consistency models — at a depth that pairs well with DDIA.
- Aphyr / JepsenCall Me Maybe — testing distributed systems under partitionKyle Kingsbury's series on real consistency bugs in real shipped databases. Required reading for anyone running sharded storage in production.
- Marc Brooker · AWSMarc's BlogLong-running essays on distributed-systems trade-offs from a senior AWS principal engineer. Search for "shuffle sharding" and "multi-tenant" for sharding-adjacent reading.
- Semicolony simulatorConsistent hashingThe math that makes resharding less catastrophic. Watch keys move when the ring shrinks.
- Semicolony guideDatabase indexingB-tree, hash, covering indexes — the local-shard primitives sharding builds on.