Consistent Hashing Simulator
Consistent hashing maps both keys and nodes onto a ring so that adding or removing a node moves only about K/N of the keys, not all of them. Modulo hashing reshuffles every key when N changes; this reshuffles only the keys that fall in the new node's arc.
Consistent hashing is a technique for distributing keys across a changing set of nodes such that adding or removing a node only requires moving a small fraction (roughly K/N) of keys, rather than rehashing everything. David Karger and colleagues introduced it in a 1997 STOC paper for distributed web caches. It is the placement scheme behind Amazon DynamoDB, Apache Cassandra, Riak, the memcached ketama client, and most production CDN edge maps. The simulator below lets you push on the ring; the long-form reference covers virtual nodes, jump hash, rendezvous hashing, and the production trade-offs.
Updated · 25 min read · 4,200 words.
The circle is the hash space wrapped end to end. Coloured ticks are virtual nodes — each physical server scattered to many ring positions — and the black dashed spoke points to your lookup key. To find its owner you walk clockwise to the first tick; the faint arc behind the ring is exactly that owned segment, and the centre label names the node. The line at the bottom shows what plain hash-mod-N would have picked instead, and the bars below sample 5,000 keys to show how evenly the load actually splits.
Set virtuals to 1 and add a few nodes: the distribution bars come out lopsided, some node hogging twice its share. Crank virtuals to 100 and they flatten toward the ideal. Then watch the "Last move" counter when you add a node — it lands near 1/(N+1) of all keys, not the near-total reshuffle modulo would cause. That gap is the whole point: compare the ring owner against the modulo owner in the readout and you can see most keys keep their home while only the new node's arc changes hands.
Why hash-mod-N rebalancing fails when N changes
Hash mod N falls apart the first time a node joins.
Consistent hashing is a technique for distributing keys across a changing set of nodes such that adding or removing one node only requires moving roughly K/N keys, rather than rehashing everything. David Karger and colleagues introduced it in a 1997 STOC paper; it is the placement scheme behind Amazon DynamoDB, Apache Cassandra, Riak, the memcached ketama client, Maglev, and most production CDN edge maps. The narrative below works through why the technique exists, starting with the dumb baseline that was the industry default beforehand.
Imagine you run a cache fleet. You have three cache servers and a million keys. The natural way to decide which server holds which key is to hash the key and take the result modulo three. hash("user:42") % 3 returns 0, so the value lives on server zero. Simple, fast, and uniform: each server holds roughly a third of the keys.
Now you add a fourth server. The formula becomes hash(key) % 4. The same hash("user:42") returns 2 instead of 0. The key has moved. Worse — almost every key has moved, because the modulo operation maps the entire hash space onto the new range. With three servers becoming four, roughly three quarters of your million keys end up on a different server than they were five seconds ago.
If your cache backs a database, every one of those displaced keys becomes a cache miss. Your database, which had been comfortably handling 5 % of incoming traffic, suddenly receives ~75 %. It buckles. The fleet falls over. You spend your evening explaining to a roomful of nervous colleagues why a routine capacity addition took the site down.
This is the original problem the consistent-hashing paper solves. The MIT group called it “Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web” — STOC 1997, Karger, Lehman, Leighton, Levine, Lewin, Panigrahy. Two of the authors went on to found Akamai. The technique has aged remarkably well: every distributed cache, every shared-nothing database, and most modern load balancers use a variant of it. The simulator above is the canonical version, with one improvement (virtual nodes) bolted on.
Concretely: the worst-case fraction of keys that change owner under modulo hashing when N becomes N+1 is N/(N+1), very close to 1. With consistent hashing it is 1/(N+1), very close to 0. For a fleet of 100 servers expanding to 101, modulo hashing rehashes ~99 % of the keyspace — a hundred-million-entry cache becomes ninety-nine million misses. Consistent hashing rehashes ~1 %. That difference is what made the technique ship in production rather than stay in a textbook.
How consistent hashing works — keys and nodes on a ring
Hash both keys and nodes onto the same circle.
The trick is one of perspective. Instead of computing a server number from the key, you map every server to a position on an imaginary circle, and every key to a position on the same circle. To find the server for a key, you start at the key's position and walk clockwise; the first server you encounter owns the key.
The circle is the hash output space — usually 0 .. 2³²−1 or 0 .. 2¹²⁸−1 — wrapped end to end so that 0 and the maximum sit next to each other. A 32-bit hash produces over four billion distinct points around the circle, so collisions are essentially impossible. To map server n1 to a position you simply hash its name: hash("n1") is the position. To map a key, the same: hash("user:42") is its position. Because the hash is uniform, both servers and keys are sprinkled around the ring without any obvious pattern.
Now the key insight: when you add a new server, it shows up as a new point on the ring. Only the keys whose position lies between the previous neighbour and the new server's position need to move. Every other key still walks clockwise to the same server it always did. Add one node out of three and you displace roughly one third of the keys. Add one out of a hundred and you displace roughly one percent. The disruption shrinks linearly with fleet size.
Look at the simulator. The faint coloured arc behind the ring is the segment owned by the current lookup key — that is, the arc clockwise of the previous tick, ending at the owning tick. Drag the key by editing the input. Add a node. Notice that most keys keep their owner; only the ones whose arc the new node bisects change hands.
The lookup itself is fast: O(log V) via binary search on the sorted ring, where V is the total number of (real or virtual) positions. With a thousand virtual entries that's ten comparisons; with a million it's twenty. The data structure to support this is a sorted array, a balanced tree, or in some implementations a skip list — all worst-case logarithmic, all fitting easily in cache for any reasonable cluster size. The constant-time variants (jump consistent hash, Maglev) trade flexibility for that last factor of speed.
Virtual nodes — why one ring position per server isn't enough
One position per server is enough in theory, painful in practice.
The plain ring has a load-balance problem. With three servers and three random points on the circle, the gaps between points are uneven — one server might end up owning a 50 % arc, another 30 %, the third 20 %. This is what statisticians call the gaps of a Poisson process: random uniform points cluster more than they spread, and you get long arcs and short arcs by luck. With ten servers the worst-case ratio is still about 5×; with a hundred, around 2×. That's not good enough for production.
The fix is to give each real server many positions on the ring instead of one. Every server is hashed under several names — n1:0, n1:1, n1:2, … n1:99 — and each hash drops a virtual point on the ring. With 100 virtual points per real server, three servers contribute 300 ticks. Any single server's share of the circle is now the sum of 100 random arcs, which thanks to the law of large numbers concentrates tightly around the average.
The simulator's "Virtuals" buttons let you set this number from 1 to 200. At 1 the distribution is wild — try it. At 5 it's still bumpy. At 50 the bars are visibly even. At 200 they're indistinguishable from a perfect 33/33/33 split. Production systems pick somewhere between 100 and 4096 virtual nodes per real node, depending on the hash table size they're willing to pay for. Cassandra defaults to 256. Dynamo's original paper used 128. Akamai used a different number and never published it.
The cost of more virtuals is mostly memory: the lookup table grows linearly. A million-entry table is fine for a per-machine cache; it can be too much for an embedded client like a memcached SDK that has to ship the whole ring with the binary. The right number is whatever keeps the load imbalance below your tolerance — usually 5 % — for the cluster size you expect to run at.
One subtle property of virtual nodes: the K/N bound applies to each individual virtual node, not to the real node as a whole. Adding a real node creates V new virtual entries, each of which steals roughly K/(V·N) keys from its clockwise neighbour. The total is the same K/N — but it is drawn from many different existing servers, not just one. That spreads the rebalancing IO across the fleet rather than concentrating it on whoever happened to be next to the new joiner. In a system that mass-streams data during rebalance (Cassandra, Riak), this is the difference between a rolling addition that takes minutes and one that pages the on-call.
The K/N rebalancing bound, proved
Why exactly K/N keys move when you add a node.
The headline claim — “adding a node displaces only K/N keys” — falls out of one observation. Each key occupies one point on the ring; each server's territory is the arc clockwise of its position up to the next server's position. If servers are uniformly distributed, then so are arc lengths in expectation, and the average arc covers 1/N of the ring. Adding one server splits exactly one existing arc into two. The keys in the half taken by the newcomer are the ones that move; the keys in the half kept by the previous owner stay. With K keys total, the expected count of movers is K · 1/N = K/N.
Removing a server is the mirror operation. Its arc disappears and the keys it owned all walk clockwise to the next server. Same expected count: K/N. Crucially the destination is deterministic — the next clockwise neighbour — so a system that tracks "who has my data" can recompute the answer locally, without coordination. Cassandra's gossip protocol and Riak's hand-off mechanism both depend on this property.
Compare the alternatives. Plain modulo hashing displaces about K(N−1)/N keys per change — for any N greater than two, that's essentially every key. Range partitioning (split the keyspace into ordered chunks) keeps neighbours together but forces a global rebalancing every time you split a hot range. Random partitioning (just pick a server and remember the choice) is fine until the metadata table itself becomes the bottleneck; this is what eventually happens with file systems built on flat-name spaces.
The K/N bound is also a defence against operator error. If somebody pages in at 3 a.m. and tries to bring a struggling node back, the worst case is one Nth of the keys briefly migrating to and from. The cluster degrades gracefully rather than catastrophically. This is much harder to engineer with stateful range partitioning, where reshuffling tends to require carefully scripted choreography.
One more piece of math you'll see referenced: the variance of the load. Even at K/N expected keys per node, the actual count fluctuates around the mean. With V virtual nodes per real, the standard deviation of any node's load is roughly σ ≈ √(K/(N·V)) / avg — meaning higher V tightens the distribution. Cassandra's choice of 256 vnodes was tuned so a 100-node cluster sees about a 6 % spread between the heaviest and lightest tokens. Smaller clusters benefit disproportionately from raising V; large ones can lower it without much risk.
Variants of consistent hashing — ring, jump hash, rendezvous, Maglev, multi-probe
Five recipes that all earn the name.
Classic ring (Karger 1997). The version above. Hash names to ring positions, walk clockwise, use virtual nodes. Used by Cassandra, Riak, memcached clients (ketama), Envoy's ring_hash load balancer.
Jump consistent hash (Lamping & Veach 2014). A seven-line algorithm that maps a key to a bucket in [0, N) using only a small amount of arithmetic — no ring data structure at all. The catch: it can only add or remove buckets at the highest-numbered end. It is the approach of choice when servers are anonymous and indistinguishable — used inside Google for sharded caches and lookup tables — but unsuitable when individual servers must be addressable by name.
Maglev hashing (Eisenbud et al 2016). Google's network load balancer assigns connections to backends by precomputing a lookup table with a permutation per backend. Each backend "claims" entries in the table in order of preference; the pattern is deterministic so two boxes with the same backend list build identical tables. Lookup is O(1) — one table read — and disruption when a backend leaves is bounded by the same K/N argument. Maglev now ships in Google Cloud's L4 load balancer; cilium's eBPF datapath implements it for in-kernel use.
Rendezvous (HRW) hashing (Thaler & Ravishankar 1996). For each key, compute hash(key ‖ node) against every node, and pick the node with the highest result. No ring at all. Easier to implement and more flexible than the ring (you can weight nodes simply by scaling their score), and the disruption per node change is still K/N. Slower than the ring on large fleets — O(N) per lookup vs O(log N) — but very popular for in-process sharding where N is small.
Bounded-load consistent hashing (Mirrokni et al 2018, used by Cloudflare). A modification to the classic ring that adds a per-node load cap: when a key is about to land on a node already at capacity, it spills over to the next clockwise node. Guarantees no node sees more than (1 + ε) times the average load, with the price that some keys are intentionally placed off-canonical. Cloudflare deployed this in 2016 and wrote about the experience.
Production systems using consistent hashing (DynamoDB, Cassandra, Memcached, Maglev)
Where the technique actually runs today.
Akamai (1998–). The original commercial deployment. Akamai's edge fleet — at one point over 350,000 servers across 4,000 networks — uses consistent hashing to route every request for a given URL to the same edge cache, regardless of where in the world the user is. Tom Leighton and Daniel Lewin took the algorithm directly from the STOC paper. Akamai still serves something on the order of 30 % of web traffic.
Amazon Dynamo (2007) and its descendants. Dynamo's SOSP 2007 paper popularised consistent hashing for sharding key-value stores. Dynamo gave rise to Cassandra (Apache, originally Facebook), Riak (Basho), Voldemort (LinkedIn), and Amazon's own DynamoDB service. Cassandra calls each virtual position a "token"; the default of 256 vnodes per node has been the recommended setting for over a decade.
Google's network stack. Maglev runs Google Cloud's L4 load balancer. Jump consistent hash is used internally for many sharded indexes. Both papers are open and have been widely re-implemented — eBPF projects like Cilium implement Maglev in-kernel for service-mesh sidecars.
Discord. Channel sharding to 4,000 Elixir nodes uses a consistent-hashed ring keyed on channel ID. The 2017 engineering post titled “How Discord Stores Billions of Messages” describes the migration from a single Cassandra to ScyllaDB; the partitioning scheme stayed the same.
Envoy and the service mesh. Envoy's ring_hash and maglev load-balancer policies expose the choice to operators. Both are used in production at Lyft, Stripe, and basically every Istio install in the wild.
Hot keys, sticky-key cardinality, and other consistent-hashing gotchas
The places it still surprises you.
Hot keys. Consistent hashing distributes keys evenly. It does not distribute load evenly. If 90 % of your traffic is for one Justin-Bieber-just-tweeted key, every request for that key lands on the single node that owns it. That node will die, and so will whoever is paged. The standard mitigations are tiered caches (a small in-memory cache in front of the ring), key salting (write the key under several derived names), or detecting the hot key with sketches like Count-Min and replicating only the offenders.
Hash function quality. The math assumes a uniform hash. CRC32 has known biases on short strings; MD5 is uniform but expensive; modern systems use MurmurHash3, xxHash, or FarmHash. Using a hash with a small output (say, 16 bits) caps the ring's resolution and silently introduces collisions. Always use ≥ 64 bits.
Sticky-key cardinality. The K/N bound assumes K is large. If you're sharding 50 keys across 100 nodes, individual nodes will own zero keys most of the time and the entire premise breaks. Consistent hashing earns its keep at scale — millions of keys, hundreds of nodes — and is overkill below.
Replication and the next-N rule. Most production systems replicate each key to the next R clockwise nodes, not just the first. This complicates the K/N argument: a node failure now reshuffles R·K/N keys (the lost replicas of every key the failed node held), not just K/N. Cassandra's read repair and Dynamo's hinted handoff are ways to amortise this.
Adversarial keys. If the keys come from an attacker (URL paths, public identifiers), a determined adversary can choose inputs that all hash into the same arc, overloading one node. The defence is either a keyed hash (HMAC with a server-side secret) or a hash function known to be resistant to such collisions. SHA-256 is the conservative choice; SipHash-2-4 is the fast option, used inside Linux's hash tables for exactly this reason.
When NOT to use consistent hashing
The cases where the ring is the wrong tool.
Range queries. Consistent hashing destroys ordering. SELECT * WHERE id BETWEEN 100 AND 200 requires touching every node, because the keys 100..200 are randomly scattered. Time-series databases, analytics warehouses, and anything with time-based scans should use range partitioning instead. CockroachDB and Spanner pair range partitioning with smart rebalancing for exactly this reason.
Strong cross-key transactions. If your operation must atomically touch multiple keys, those keys had better land on the same node. Consistent hashing offers no control over co-location. Systems like Spanner offer "interleaved tables" and "directories" precisely so the operator can declare what should stay together; the default ring offers no such hook.
Strict ordering. Append-only logs, event streams, financial ledgers — these need a single primary, often replicated by Raft or Paxos. A ring is the wrong abstraction. Kafka partitions by hash of the key (which is consistent-hashing-shaped), but order is preserved only within a partition. Global ordering across partitions cannot use consistent hashing at all.
Tiny clusters. Below about ten nodes, the load imbalance from random arc lengths is large enough that the bound “K/N keys move” is a coarse promise. Round-robin or rendezvous hashing is simpler at that scale. Consistent hashing's real wins kick in at hundreds of nodes and millions of keys.
Sticky session affinity. When you need a request to land on the same backend that handled the last one (long polling, WebSocket, anything stateful), the right tool is a session-affinity table indexed by cookie or client IP — not consistent hashing of the client's identity. The K/N rebalancing of consistent hashing means session continuity breaks the moment a node leaves; explicit session stores avoid that surprise.
Further reading on consistent hashing
Primary sources, in order.
- Karger et al · 1997Consistent Hashing and Random TreesThe STOC paper. Worth reading just for the theorem on the load-balanced expected case.
- DeCandia et al · 2007Dynamo: Amazon's Highly Available Key-value StoreThe SOSP paper that made consistent hashing the default sharding scheme in industry.
- Lamping & Veach · 2014A Fast, Minimal Memory, Consistent Hash AlgorithmJump consistent hash. Seven lines of code. Used inside Google for sharded caches.
- Eisenbud et al · 2016Maglev: A Fast and Reliable Software Network Load BalancerNSDI 2016. Google Cloud's L4 LB; Maglev hashing in production at planetary scale.
- Mirrokni et al · 2018Consistent Hashing with Bounded LoadsThe bounded-load variant. Guarantees no node gets more than (1 + ε) the average load.
- Cloudflare blogImproving load balancing with a new consistent-hashing algorithmCloudflare's production deployment of bounded-load hashing across their edge fleet.
- Discord engineering · 2017How Discord Stores Billions of MessagesChannel-sharded Cassandra (later ScyllaDB) with consistent hashing as the partitioner.
- Semicolony guideLoad balancingWhere consistent hashing fits — round-robin, least-conn, IP-hash, ring-hash.
- Semicolony guideCDN edgesAkamai's original use case — routing requests to the right edge cache.