Sharding and partitioning
One machine's RAM, disk, and CPU run out long before your dataset does. Sharding splits the data across N machines so each holds roughly 1/N. The price: every query that crosses shards pays a network round-trip and coordination cost, and one shard always seems to be hotter than the others.
Why shard at all
A single machine has a ceiling. Modern boxes top out around 1–2 TB of RAM, a few hundred thousand IOPS on NVMe, and tens of cores. Datasets routinely outgrow all three: Facebook's social graph, Stripe's ledger, Discord's message store. When vertical scaling stops being affordable or possible, the data has to live on more than one box.
Sharding (also called partitioning) is the answer: split the keyspace across N machines so each holds roughly 1/N of the data. You pay for it on every operation. A single-shard query stays as fast as before. A cross-shard query pays a network round-trip plus the cost of merging results. The whole game is picking a scheme where the queries you actually run stay single-shard most of the time.
Range partitioning
Data is sorted by key, and contiguous ranges are assigned to shards. Shard 1 holds keys
a..f, shard 2 holds g..m, and so on. This is what CockroachDB,
HBase, BigTable, and Spanner do underneath.
Range scans are cheap because they're sequential. A query for "all orders between yesterday and today" hits one shard and reads contiguous disk blocks. The downside is hot keys. A single trending row, or a write pattern that always appends to the end of the keyspace (timestamps, auto-increment IDs), can saturate one shard while every other shard sits idle. Most range-partitioned systems handle this by splitting ranges when they get too large or too hot. CockroachDB splits when a range reaches 512 MB or its QPS crosses a threshold.
Hash partitioning
The key is hashed (typically MD5 or MurmurHash), and the hash space is split across shards. Cassandra, DynamoDB, and Postgres-partman in hash mode all use this. Distribution is uniform almost by construction: even sequential writes scatter across the cluster because their hashes are pseudo-random.
Range scans pay the cost. "All orders between yesterday and today" now has to fan out to every shard, because adjacent keys land on unrelated machines. With 64 shards that's 64 round-trips for a query that range partitioning would have served from one. Hash partitioning is the right call when access is mostly by primary key and range scans are rare.
Consistent hashing
Naive hash partitioning has a problem: change N from 8 to 9 and almost every key re-maps. Karger et al., 1997 fixed this with a ring. Hash both keys and nodes onto a circular space (0 to 232). Each key is owned by the next node clockwise on the ring. Adding a node steals only the slice between itself and its predecessor; removing a node gives its slice to its successor. Roughly K/N keys move, not K keys, which is the whole point.
In practice every physical node owns many "virtual nodes" on the ring so the slices come out more evenly. Cassandra defaults to 256 vnodes per node; Riak and memcached use the same trick. The Dynamo paper popularised the pattern — see the Dynamo deep dive for the production-grade variant with sloppy quorum and hinted handoff.
# <a href="/how-it-works/load-balancing">Consistent hashing</a> in eight lines
ring = sorted([(hash(node + ":" + str(v)), node)
for node in nodes
for v in range(VNODES)])
def lookup(key):
h = hash(key)
# first ring entry with hash >= h, wrap around if needed
i = bisect_left(ring, (h, ""))
return ring[i % len(ring)][1]Rendezvous (highest-random-weight) hashing
Thaler & Ravishankar, 1996. For each key, compute a score hash(key, node)
against every node; the highest-scoring node wins. No ring, no virtual nodes, no sorted
structure to maintain. Add or remove a node and only the keys whose winner changed have
to move. That still works out to K/N on average, but with less variance than vanilla
consistent hashing.
The cost is O(N) per lookup instead of O(log N), which is fine when N is in the tens or
low hundreds. Akamai's edge CDN uses rendezvous to map URLs to cache servers; some
software load balancers (HAProxy with the hash-type consistent variant,
Envoy's ring-hash policy) offer it as an option alongside consistent hashing.
Directory-based sharding
A lookup service holds an explicit map from key (or key range) to shard. This is the most flexible option: any key can live anywhere, and you can move a single tenant from shard 3 to shard 7 without touching anything else. The cost is a network hop on every operation, plus the work of running and replicating the directory itself.
Vitess (the sharding layer underneath YouTube's and Slack's MySQL) takes this approach with its VSchema and topology service. Twitter's Manhattan uses a similar directory for tenant-to-cluster mapping. Many game backends pin players to shards via a directory so a player and all their save state move together when a shard fills up.
Hierarchical sharding
Shard first by tenant ID, then by entity within the tenant. F1 and Spanner's
interleaved tables let you declare Orders INTERLEAVE IN PARENT Customers —
a customer and all their orders share a key prefix and live in the same shard. See the
F1 deep dive for
how this composes with Spanner's TrueTime.
The payoff is that the queries that matter most — "give me everything for this customer" — never cross shards. The risk is the same as range partitioning at a coarser grain: a whale tenant that doesn't fit on a single shard breaks the model and has to be split out by hand.
Resharding, the hardest operations problem
The naive approach (stop the world, dump every shard, repartition, restore) works for a side project and falls over for anything taking live traffic. Production systems reshard online, with reads and writes flowing the whole time. Four patterns dominate:
- Live resharding with proxy redirection (Vitess). A proxy in front
of the database dual-writes to both old and new shards during cutover, then reads
flip once the new shard has caught up.
VReplicationstreams the change log in the background until the new shard is consistent. - Logical and physical split (CockroachDB). Ranges split automatically when they reach 512 MB or get too hot. The split is a metadata operation first; the data follows lazily. Replicas can then be rebalanced across nodes by the allocator.
- Token-range migration with streaming (Cassandra). When a node joins, it claims a slice of the ring and streams data from its neighbours. Old owners keep serving reads until the stream completes, then drop the data.
- Read-from-old / write-to-new with backfill (custom microservices). The application writes to both stores, reads from the old store, and a backfill job copies historical data into the new one. Once they match, reads flip. Painful to get right but very general.
Hot shards and hot keys
Some keys are just hotter than others: Justin Bieber's follower list, the Reddit front-page post, the TikTok For You feed's top video. Spotting this in production usually means per-key heat metrics (top-N requests by key, sampled) and distributed tracing that tags hot ranges. Once you know which key is hot, your options:
- Key suffix randomisation. Split
post:42intopost:42:0throughpost:42:15, write to a random suffix, and read by fanning out. Trades a 16x read fanout for a 16x write spread. - Vertical splitting. A single hot row — say, a counter — becomes 64 rows that get summed on read. Cassandra's counter columns and DynamoDB's write-sharded counters work this way.
- Read-through cache. A hot read key is served from Redis or a CDN before touching the database. Works only for read-hot, not write-hot.
- Queueing in front of the hot shard. An admission queue rate-limits writes to the hot key so the shard never falls over, at the cost of write latency for the unlucky users.
Cross-shard transactions
Any transaction touching more than one shard has three choices, none of them free:
- Two-phase commit. A coordinator asks every shard to prepare, then to commit. Correct but slow (two round-trips, plus blocking on coordinator failure) and brittle under network partitions.
- Global ordering (Spanner-style). TrueTime gives every transaction a timestamp that's externally consistent across shards. Expensive infrastructure (GPS + atomic clocks) but the cleanest semantics.
- Eventual consistency between shards. Each shard commits locally, background reconciliation handles the cross-shard invariant. Cheap and fast, but the application has to tolerate windows where the two shards disagree.
Most production systems avoid the choice by co-locating related data. F1's interleaved tables, Vitess's keyspace IDs, DynamoDB's partition-key-plus-sort-key are all schema decisions that keep the transactions you actually run on one shard. The schema is the shard-routing decision.
Sharding schemes compared
| Scheme | Rebalance cost | Range scans | Hot-key resistance | Cross-shard queries |
|---|---|---|---|---|
| Range | Low (move ranges) | Cheap | Poor | Few if keys cluster by query pattern |
| Hash | High (K keys move on resize) | Catastrophic | Good | Frequent for any non-key query |
| Consistent hash | K/N keys move | Catastrophic | Good (with vnodes) | Same as hash |
| Rendezvous | K/N keys move, lower variance | Catastrophic | Good | Same as hash |
| Directory | Per-key, arbitrary | Depends on layout | Excellent (move hot keys) | One extra hop per op |
| Hierarchical | Per-tenant | Cheap within tenant | Whale-tenant risk | Rare if scoped to tenant |
Further reading
- Karger et al. (1997) — Consistent Hashing and Random Trees — the original paper. Aimed at distributed caching for the web, generalised everywhere since.
- DeCandia et al. (2007) — Dynamo: Amazon's Highly Available Key-value Store — consistent hashing in anger, plus sloppy quorum, hinted handoff, and vector clocks.
- Corbett et al. (2012) — Spanner: Google's Globally-Distributed Database — Paxos per tablet, TrueTime for cross-shard ordering, externally consistent transactions.
- Shute et al. (2013) — F1: A Distributed SQL Database That Scales — interleaved tables and hierarchical sharding in production at Google Ads.
- Pedreira et al. (2020) — Vitess: A Database Clustering System for Horizontal Scaling of MySQL — the directory-based MySQL sharding system behind YouTube and Slack.
- CockroachDB — Scaling Raft and online range splits — how range splits, replica rebalancing, and Raft compose at petabyte scale.