15 / 20
Topics / 15

Sharding and partitioning

One machine's RAM, disk, and CPU run out long before your dataset does. Sharding splits the data across N machines so each holds roughly 1/N. The price: every query that crosses shards pays a network round-trip and coordination cost, and one shard always seems to be hotter than the others.


Why shard at all

A single machine has a ceiling. Modern boxes top out around 1–2 TB of RAM, a few hundred thousand IOPS on NVMe, and tens of cores. Datasets routinely outgrow all three: Facebook's social graph, Stripe's ledger, Discord's message store. When vertical scaling stops being affordable or possible, the data has to live on more than one box.

Sharding (also called partitioning) is the answer: split the keyspace across N machines so each holds roughly 1/N of the data. You pay for it on every operation. A single-shard query stays as fast as before. A cross-shard query pays a network round-trip plus the cost of merging results. The whole game is picking a scheme where the queries you actually run stay single-shard most of the time.

Range partitioning

Data is sorted by key, and contiguous ranges are assigned to shards. Shard 1 holds keys a..f, shard 2 holds g..m, and so on. This is what CockroachDB, HBase, BigTable, and Spanner do underneath.

Range scans are cheap because they're sequential. A query for "all orders between yesterday and today" hits one shard and reads contiguous disk blocks. The downside is hot keys. A single trending row, or a write pattern that always appends to the end of the keyspace (timestamps, auto-increment IDs), can saturate one shard while every other shard sits idle. Most range-partitioned systems handle this by splitting ranges when they get too large or too hot. CockroachDB splits when a range reaches 512 MB or its QPS crosses a threshold.

Hash partitioning

The key is hashed (typically MD5 or MurmurHash), and the hash space is split across shards. Cassandra, DynamoDB, and Postgres-partman in hash mode all use this. Distribution is uniform almost by construction: even sequential writes scatter across the cluster because their hashes are pseudo-random.

Range scans pay the cost. "All orders between yesterday and today" now has to fan out to every shard, because adjacent keys land on unrelated machines. With 64 shards that's 64 round-trips for a query that range partitioning would have served from one. Hash partitioning is the right call when access is mostly by primary key and range scans are rare.

Range vs hash, at a glance. Range wins for time-series, log-style data, and any workload where you scan adjacent keys. Hash wins for point lookups by primary key, when write-skew is the worry, and for caches. Most production systems use both. Cassandra lets you compose a partition key (hash) with a clustering key (range within the partition).

Consistent hashing

Naive hash partitioning has a problem: change N from 8 to 9 and almost every key re-maps. Karger et al., 1997 fixed this with a ring. Hash both keys and nodes onto a circular space (0 to 232). Each key is owned by the next node clockwise on the ring. Adding a node steals only the slice between itself and its predecessor; removing a node gives its slice to its successor. Roughly K/N keys move, not K keys, which is the whole point.

In practice every physical node owns many "virtual nodes" on the ring so the slices come out more evenly. Cassandra defaults to 256 vnodes per node; Riak and memcached use the same trick. The Dynamo paper popularised the pattern — see the Dynamo deep dive for the production-grade variant with sloppy quorum and hinted handoff.

# <a href="/how-it-works/load-balancing">Consistent hashing</a> in eight lines
ring = sorted([(hash(node + ":" + str(v)), node)
               for node in nodes
               for v in range(VNODES)])

def lookup(key):
    h = hash(key)
    # first ring entry with hash >= h, wrap around if needed
    i = bisect_left(ring, (h, ""))
    return ring[i % len(ring)][1]

Rendezvous (highest-random-weight) hashing

Thaler & Ravishankar, 1996. For each key, compute a score hash(key, node) against every node; the highest-scoring node wins. No ring, no virtual nodes, no sorted structure to maintain. Add or remove a node and only the keys whose winner changed have to move. That still works out to K/N on average, but with less variance than vanilla consistent hashing.

The cost is O(N) per lookup instead of O(log N), which is fine when N is in the tens or low hundreds. Akamai's edge CDN uses rendezvous to map URLs to cache servers; some software load balancers (HAProxy with the hash-type consistent variant, Envoy's ring-hash policy) offer it as an option alongside consistent hashing.

Directory-based sharding

A lookup service holds an explicit map from key (or key range) to shard. This is the most flexible option: any key can live anywhere, and you can move a single tenant from shard 3 to shard 7 without touching anything else. The cost is a network hop on every operation, plus the work of running and replicating the directory itself.

Vitess (the sharding layer underneath YouTube's and Slack's MySQL) takes this approach with its VSchema and topology service. Twitter's Manhattan uses a similar directory for tenant-to-cluster mapping. Many game backends pin players to shards via a directory so a player and all their save state move together when a shard fills up.

Hierarchical sharding

Shard first by tenant ID, then by entity within the tenant. F1 and Spanner's interleaved tables let you declare Orders INTERLEAVE IN PARENT Customers — a customer and all their orders share a key prefix and live in the same shard. See the F1 deep dive for how this composes with Spanner's TrueTime.

The payoff is that the queries that matter most — "give me everything for this customer" — never cross shards. The risk is the same as range partitioning at a coarser grain: a whale tenant that doesn't fit on a single shard breaks the model and has to be split out by hand.

Resharding, the hardest operations problem

The naive approach (stop the world, dump every shard, repartition, restore) works for a side project and falls over for anything taking live traffic. Production systems reshard online, with reads and writes flowing the whole time. Four patterns dominate:

  • Live resharding with proxy redirection (Vitess). A proxy in front of the database dual-writes to both old and new shards during cutover, then reads flip once the new shard has caught up. VReplication streams the change log in the background until the new shard is consistent.
  • Logical and physical split (CockroachDB). Ranges split automatically when they reach 512 MB or get too hot. The split is a metadata operation first; the data follows lazily. Replicas can then be rebalanced across nodes by the allocator.
  • Token-range migration with streaming (Cassandra). When a node joins, it claims a slice of the ring and streams data from its neighbours. Old owners keep serving reads until the stream completes, then drop the data.
  • Read-from-old / write-to-new with backfill (custom microservices). The application writes to both stores, reads from the old store, and a backfill job copies historical data into the new one. Once they match, reads flip. Painful to get right but very general.
The reason K/N matters. Vanilla hash partitioning moves nearly every key when N changes — at petabyte scale that's months of network bandwidth and disk I/O. Consistent hashing and rendezvous cut this to K/N (roughly 1/9th of keys when going from 8 to 9 nodes), which is the difference between "reshard in an afternoon" and "reshard never".

Hot shards and hot keys

Some keys are just hotter than others: Justin Bieber's follower list, the Reddit front-page post, the TikTok For You feed's top video. Spotting this in production usually means per-key heat metrics (top-N requests by key, sampled) and distributed tracing that tags hot ranges. Once you know which key is hot, your options:

  • Key suffix randomisation. Split post:42 into post:42:0 through post:42:15, write to a random suffix, and read by fanning out. Trades a 16x read fanout for a 16x write spread.
  • Vertical splitting. A single hot row — say, a counter — becomes 64 rows that get summed on read. Cassandra's counter columns and DynamoDB's write-sharded counters work this way.
  • Read-through cache. A hot read key is served from Redis or a CDN before touching the database. Works only for read-hot, not write-hot.
  • Queueing in front of the hot shard. An admission queue rate-limits writes to the hot key so the shard never falls over, at the cost of write latency for the unlucky users.

Cross-shard transactions

Any transaction touching more than one shard has three choices, none of them free:

  • Two-phase commit. A coordinator asks every shard to prepare, then to commit. Correct but slow (two round-trips, plus blocking on coordinator failure) and brittle under network partitions.
  • Global ordering (Spanner-style). TrueTime gives every transaction a timestamp that's externally consistent across shards. Expensive infrastructure (GPS + atomic clocks) but the cleanest semantics.
  • Eventual consistency between shards. Each shard commits locally, background reconciliation handles the cross-shard invariant. Cheap and fast, but the application has to tolerate windows where the two shards disagree.

Most production systems avoid the choice by co-locating related data. F1's interleaved tables, Vitess's keyspace IDs, DynamoDB's partition-key-plus-sort-key are all schema decisions that keep the transactions you actually run on one shard. The schema is the shard-routing decision.

Sharding schemes compared

SchemeRebalance costRange scansHot-key resistanceCross-shard queries
RangeLow (move ranges)CheapPoorFew if keys cluster by query pattern
HashHigh (K keys move on resize)CatastrophicGoodFrequent for any non-key query
Consistent hashK/N keys moveCatastrophicGood (with vnodes)Same as hash
RendezvousK/N keys move, lower varianceCatastrophicGoodSame as hash
DirectoryPer-key, arbitraryDepends on layoutExcellent (move hot keys)One extra hop per op
HierarchicalPer-tenantCheap within tenantWhale-tenant riskRare if scoped to tenant

Further reading

Found this useful?