Skip List Simulator: coin flips instead of rotations.

Linked lists, stacked in lanes, with promotions decided by random coin flips. Expected O(log n) for everything — and dramatically simpler to make concurrent than a balanced tree.

Nodes
0
Height
0
E[height]
0
Last cost
0

Probability p
Operate
Presets
Expected memory (at this p)
n / (1 − p) ≈ 0 total node slots
Lanes
lane 0 — empty — insert keys above —
node on search path found
Total nodes
0
expected slots ≈ n/(1−p) = 0
Actual height
0
expected = log_(1/p) n ≈ 0
Last search cost
0
steps right across all lanes
Trace
— quiet —

What you're looking at

Each column is one key, and how tall the column climbs is that key's level — the number of lanes it sits on. Lane 0 at the bottom holds every key in sorted order; higher lanes hold the random subset that won their coin flips, shown under each column as the H/H/T sequence. The stat panel compares the actual height against the expected log height, and tracks the cost of your last search in steps taken to the right.

Seed the 15 keys, then search a value that lives high up, like 17, and watch the highlighted cells trace a staircase: walk right along a sparse top lane until the next hop would overshoot, drop a lane, walk again, drop again, down to the hit at lane 0. Re-seed a few times and the tower shapes change every run, because levels come from coin flips, not balancing. The surprise is that this randomness still lands search cost near log n every time, with none of a balanced tree's rotations.


What is a skip list?

A probabilistic alternative to balanced trees.

A skip list is an ordered data structure built from a stack of linked lists. The bottom list holds every key in sorted order. Each higher list holds a random subset of the keys below — typically half. Search walks the top list until the next pointer would overshoot, drops to the next list down, walks again, drops again, all the way to the bottom. Expected search cost is O(log n); insert and delete are the same shape.

The trick is that the levelling is decided at insertion time, by a coin flip, not by global rebalancing. A new node flips a coin: heads, promote it to lane 1; flip again, heads again, promote to lane 2; the first tails stops the promotion. Independent flips, geometric distribution, no global structure to maintain. There is no rotation, no recolouring, no rebalancing cascade. Just splice the node into each of its lanes and you're done.

That simplicity is the whole point. The author of the original paper, William Pugh, wrote in 1990: "Skip lists are a data structure that can be used in place of balanced trees. Skip lists use probabilistic balancing rather than strictly enforced balancing and as a result the algorithms for insertion and deletion in skip lists are much simpler and significantly faster than equivalent algorithms for balanced trees." Three decades later that pitch still holds.

The simulator above shows a skip list with up to six lanes. Insert keys and watch which lanes each one occupies — the column under the key shows its level. Search animates the staircase descent: every cell highlighted is a step the algorithm took. Toggle the coin-flip display to see exactly which sequence (H H T, for example) gave each node its level.

SEARCH 22 — DESCEND LANE BY LANE lane 3 lane 2 lane 1 lane 0 3 7 12 17 19 22 25 31 41 53EXPECTED HEIGHT log_2 10 ≈ 4 · EXPECTED COST log_2 10 ≈ 4 STEPS

Origins — William Pugh, 1990

Most workloads don't need worst-case guarantees; they need simplicity.

The skip list was introduced by William Pugh in Skip Lists: A Probabilistic Alternative to Balanced Trees, published in Communications of the ACM in June 1990. Pugh was a young assistant professor at Maryland; the paper was eight pages, the algorithms were maybe forty lines of code, and the analysis fit in a handful of paragraphs. The contrast with the previous decade's parade of balanced-tree variants — red-black, AVL, weight-balanced, BB[α], 2-3-4, splay — was deliberate.

Pugh's central insight was sociological as much as algorithmic: most production workloads don't actually need a worst-case guarantee, they need a structure that's easy to implement correctly and easy to reason about. A red-black tree has fifteen cases for insertion, give or take, depending on which textbook you read. A skip list has three lines: flip coins to pick a level, splice into each lane's linked list, done. The same goes for delete. The savings in code complexity translate directly to fewer bugs in production.

The paper also made the case for skip lists on theoretical grounds. The variance of search cost is tightly bounded: the probability that a search takes more than c log n steps decays exponentially in c. In practice this means a billion-key skip list has search costs that hover within a few steps of the expected value, with the tail probability of seeing something significantly worse being astronomically small. Worst case isn't bounded, but for any workload that runs more than a handful of queries, the empirical worst case is indistinguishable from the expected case.

Pugh's paper also included a deterministic variant (with Munro, Papadakis, and Sedgewick) which trades some of the simplicity to achieve a worst-case bound. That variant exists in textbooks but rarely in production code; the probabilistic version's simplicity beat its deterministic cousin in every implementation that mattered.

The level distribution — coin flips and geometric series

Geometric distribution, expected 2n nodes at p=1/2.

Every inserted key gets a level drawn from a geometric distribution: keep flipping a biased coin (heads with probability p, tails with probability 1−p) and the level is the number of heads before the first tails, plus one. With p = 1/2 this means half of all keys live only on lane 0, a quarter live on lanes 0 and 1, an eighth on lanes 0 through 2, and so on. The expected number of nodes on lane k is n · p^k.

The total memory footprint is the sum of a geometric series: n + np + np² + np³ + ... = n / (1 − p). At p = 1/2 that's 2n total node slots across all lanes. At p = 1/4 it's 4n/3 ≈ 1.33n. Each "slot" is just a pointer plus the bookkeeping shared with the bottom-lane node.

The expected maximum level is log_(1/p) n. For a million keys at p = 1/2 that's roughly 20 levels; at p = 1/4 it's roughly 10. The maximum is rarely much higher than the expected — the tail of the geometric distribution drops off exponentially fast.

Implementations usually cap the maximum level at some constant (32 in LevelDB and RocksDB, 32 in Redis) to avoid pathological tails. The cap is high enough that exceeding it is vanishingly unlikely for any realistic dataset, and low enough that the per-node level-pointer array fits in a fixed allocation.

pE[height] for n=10⁶E[total nodes]E[search cost]
1/2≈ 202n~ 2·log₂ n ≈ 40
1/4≈ 10~1.33n~ 4·log₄ n ≈ 40
1/e ≈ 0.37≈ 14~1.58n~ e·ln n ≈ 38
1/8≈ 7~1.14n~ 8·log₈ n ≈ 56

1/e is the theoretical optimum for minimising total expected search cost. In practice nobody implements it because (a) 1/2 is easier to compute (one bit per flip), and (b) the difference is small enough to be irrelevant against constant-factor effects like cache behaviour.

Why probability 1/2 — and when 1/4 wins

Memory vs search cost trade-off.

The default choice in Pugh's paper is p = 1/2. It's the simplest to implement — one bit per flip — and it gives a clean total memory of 2n nodes. Search cost is roughly 2 log₂ n comparisons. Most implementations that don't care about constant-factor tuning use 1/2 and never look back.

Redis uses p = 1/4 in its sorted-set implementation (see src/t_zset.c in the Redis source). The reasoning is in the inline comments: at 1/4, the per-key memory overhead is only 4n/3 ≈ 1.33n node slots — a 33% saving over 1/2. Search cost goes up slightly (each lane covers more ground, so each individual lane-walk takes more steps on average) but the total step count is asymptotically the same. For Redis, where every sorted set is in RAM and memory is the binding constraint, the trade is correct.

LevelDB and RocksDB use p = 1/4 in their memtable implementations (db/skiplist.h). Same reasoning: memtables live in DRAM, and the memtable's memory footprint determines how often it has to flush to L0. Smaller per-key overhead means more keys per memtable means fewer flush events.

What about p = 1/e ≈ 0.37? Theoretically optimal for minimising expected total work, but nobody implements it because (a) computing it requires more than one coin flip per decision, (b) the savings are small enough to be lost in the noise of cache behaviour and pointer chasing, and (c) it's harder to reason about than nice rational numbers like 1/2 or 1/4.

A practical knob, not a theoretical one

Most implementations expose p only as a compile-time constant. If you're writing a skip list for production, default to 1/2 unless you have a measured reason to change it. If memory pressure is a top constraint and you've benchmarked, drop to 1/4. Don't tune below that without numbers.

Search, insert, delete — all the same staircase

Three operations, one descent.

Search starts at the head node on the highest occupied lane. Walk right: while the next pointer's key is less than (or equal to, depending on convention) the search target, advance. When the next pointer would overshoot, drop down one lane. Repeat. When you reach lane 0 and stop advancing, you're either at the target or the target isn't present. Total work: expected O(log n).

Insert is search plus a splice. While descending, remember the last node visited on each lane — these are the "update" pointers. Flip coins to choose the new node's level. Splice the new node into each of those lanes' linked lists by adjusting update[lane].next[lane]. The whole operation is constant work per level plus the descent: expected O(log n) total.

Delete is identical in shape. Search and record update pointers. If the target exists, unlink it from each of its lanes by re-routing the update pointers. The lanes' linked-list invariant is local — no cascading rebalance, no rotation, nothing.

The contrast with a balanced-tree implementation is stark. Red-black trees have around fifteen distinct cases for insertion (five basic, more for symmetric variants and re-colourings); delete is worse. Skip list insert is forty lines of clean code. That's the whole pitch — the implementation fits in your head, which means you can read it, debug it, and modify it without losing the thread.

# Skip list insert — the whole thing
def insert(skiplist, key):
 update = [None] * MAX_LEVEL
 node = skiplist.head
 for lvl in range(skiplist.level - 1, -1, -1):
  while node.next[lvl] and node.next[lvl].key < key:
   node = node.next[lvl]
  update[lvl] = node
 new_level = random_level()  # geometric coin flip
 new_node = Node(key, new_level)
 for lvl in range(new_level):
  new_node.next[lvl] = update[lvl].next[lvl]
  update[lvl].next[lvl] = new_node
 skiplist.level = max(skiplist.level, new_level)

Redis sorted sets — the canonical skip-list workload

O(log n) range queries, plus rank-by-score in one structure.

The Redis sorted set (ZSET) is one of the two non-trivial data structures Redis ships (the other being its hash table). It stores members ranked by a floating-point score, and supports ZADD, ZRANGE, ZRANGEBYSCORE, ZRANK, and the inverse-rank queries — all in expected O(log n). The implementation is in src/t_zset.c: a skip list ordered by score, paired with a hash table from member name to score for O(1) membership.

Why a skip list rather than a red-black tree? Salvatore Sanfilippo's original 2009 commit message gives the reasons: easier to implement correctly, easier to make safe under Redis' single-threaded event loop with copy-on-write fork for persistence, and the data structure ports cleanly to the on-disk RDB format. A skip list's lanes are independent linked lists; serialising and deserialising one is a flat traversal. A red-black tree's parent-pointer invariants are harder to round-trip through disk.

Sanfilippo also added a rank field to each node — the number of nodes in front of it on lane 0 — to support ZRANK in O(log n) without scanning the bottom lane. Each forward pointer carries a "span" indicating how many nodes it skips at lane 0; the rank is the sum of spans on the descent path. This was a Redis-specific extension not in Pugh's paper, but it slides into the skip-list framework with three extra lines of code per insert and delete.

The performance numbers are dependable. Redis sustains tens of thousands of ZADD operations per second on commodity hardware, with ZRANGE returning the top-N elements in microseconds. The leaderboard pattern — sorted set keyed by user ID with score = points — is the most cited Redis use case, and it works because skip lists give you O(log n) update and O(log n + k) range-by-rank in the same structure.

The hybrid trick

For small sorted sets (under 128 elements by default), Redis avoids the skip list entirely and uses a packed list (listpack). The skip list is only constructed when the set grows past the threshold or any member's score string exceeds 64 bytes. The crossover point is tunable via zset-max-listpack-entries and zset-max-listpack-value. Cache locality wins below the threshold; log n wins above it.

LevelDB and RocksDB memtables — concurrent inserts with no rebalancing

The memtable connection.

LSM tree storage engines stage all writes in an in-memory sorted structure — the memtable — before flushing it to disk as an immutable SSTable. The memtable needs three things: ordered iteration (for the flush, which writes keys in sorted order), point lookups (to serve reads for recently written keys), and concurrent inserts (writers from multiple threads).

LevelDB's memtable (db/skiplist.h) is a skip list because it satisfies all three with the simplest possible implementation. The insert protocol uses atomic CAS on the next-pointer at each level; readers never block, even under heavy write contention. There's no rebalancing — every new key just splices in. The whole file is 380 lines including comments.

RocksDB inherited the same code and added knobs: the default memtable is still a skip list (SkipListFactory), with memtable_prefix_bloom_size_ratio adding an optional bloom filter on top to short-circuit lookups for absent keys, and VectorMemTableRep as an alternative for write-only workloads where the order doesn't matter until flush time. The skip-list memtable is RocksDB's default for good reason: lock-free concurrent inserts plus ordered iteration in one structure.

Compare this to what a B-tree memtable would require: per-page locks, split cascades on concurrent inserts, and a much harder concurrency proof. The skip list's level independence is precisely what makes it concurrency-friendly. A writer on lane 0 doesn't interfere with a writer on lane 2; CAS at each lane handles all the coordination needed.

Lock-free skip lists — ConcurrentSkipListMap and the Harris linked list

The structure that takes best to lock-free concurrency.

Java's java.util.concurrent.ConcurrentSkipListMap (introduced in Java 6, 2006) is the canonical lock-free skip list in mainstream code. Written by Doug Lea, it's used everywhere from Kafka's offset index to Cassandra's memtables to countless ad-hoc concurrent ordered maps in JVM apps. The whole class is about 3000 lines, but the core algorithm — search, insert, delete — is small.

The foundation is Harris and Michael's lock-free linked list (Harris, 2001; Michael, 2002). The key idea: each next-pointer can be in three states — pointing to a real node, marked as "logically deleted" (the low bit of the pointer is set), or null. Inserts and deletes use compare-and-swap on these pointers. A reader walking the list and encountering a marked node knows to splice it out as it passes; deletion is therefore a two-step process (mark, then unlink) that doesn't require coordination between deleters and other operations.

Doug Lea's contribution was to lift the Harris-Michael protocol from a single linked list to a multi-lane skip list. Each lane is independently lock-free; coordination between lanes uses additional CAS steps but no locks. The result is an ordered concurrent map with linearisability guarantees and no thread ever blocking — under contention, threads may retry their CAS, but progress is always made somewhere in the system.

The same protocol shows up in other production code: Microsoft's CLR uses lock-free skip lists internally; Folly (Facebook's C++ library) ships a ConcurrentSkipList; Cassandra's pre-3.0 memtable was a Java ConcurrentSkipListMap (later replaced with a custom off-heap implementation for GC reasons).

Skip list vs red-black tree — same big-O, very different code

Pick the data structure that fits your concurrency model.

Both red-black trees and skip lists offer expected O(log n) for search, insert, and delete. Red-black trees give a worst-case bound; skip lists give a probabilistic bound with exponentially decaying tail. For any workload running more than a handful of operations, the empirical difference is negligible.

Where they diverge is in code complexity and concurrency. A red-black tree's insert has ~5 cases plus their mirror images for re-colourings and rotations; delete has more. Getting the implementation correct is a rite of passage in CS textbooks because it's hard. A skip list's insert is a descent plus a splice — perhaps 40 lines of code. Delete is the same shape.

Cache behaviour favours red-black trees slightly: a tree node has constant size and the path is shallow. Skip-list nodes are variable-size (the level array) and a single traversal may touch many cache lines. For tight single-threaded read-heavy workloads, a red-black tree is usually a hair faster.

Concurrency strongly favours skip lists. Red-black trees with rotations are hard to make concurrent without coarse locking; the rebalancing cascade can touch arbitrary parts of the tree. A skip list's per-lane CAS scales to many threads with no global coordination. This is why Java picked ConcurrentSkipListMap rather than ConcurrentRedBlackTree — the latter was tried and was much harder to implement correctly.

For in-memory data structures with high read concurrency and low write contention, either works. For high write contention, prefer the skip list. For embedded use (kernel rbtree, glibc std::map, intrusive lists), where every byte matters and concurrency is handled externally, red-black trees are the standard.

PropertyRed-black treeSkip list
Search / insert / deleteO(log n) worst caseO(log n) expected
Variance of search cost0 (deterministic)O(log n) std-dev
Implementation lines~300+ lines~40–100 lines
Cache behaviourbetter (compact nodes)worse (variable-size nodes)
Concurrent insertshard (rotations)easy (per-lane CAS)
Used byLinux kernel, glibc, STLRedis, LevelDB, RocksDB, Java J.U.C

What breaks — three failure modes worth remembering

The probabilistic guarantee assumes a real random source.

Bad RNG. A skip list with a deterministic or predictable RNG is vulnerable to adversarial inputs. An attacker who knows the seed can craft inserts that all land on the same lane, degrading the structure to a linked list with O(n) operations. Real implementations use either a per-instance random seed or a hash-based level (some implementations level each key by hashing the key and counting leading zeros). Redis uses random() from libc, seeded at server startup; LevelDB uses an XORshift PRNG with a per-thread seed.

Level cap too low. If the maximum level is capped too aggressively (say, 8) and the dataset grows past 2⁸ = 256 keys, search cost no longer scales logarithmically — every operation degrades toward the linked-list baseline at the top lane. Production implementations use caps of 32 (LevelDB, RocksDB, Redis) which is safe up to about 2³² = 4 billion keys; anything smaller starts to bite when the dataset grows.

Deletion with duplicates. A naive delete that removes "the first occurrence of key k" can leave dangling pointers at higher lanes pointing to a node that's been unlinked from lane 0. Production implementations either disallow duplicates outright (Redis sorted sets) or store an explicit count per node (multimap variants). Getting this wrong is a classic source of memory-corruption bugs in custom implementations.

Concurrent delete races. Lock-free skip-list deletes are subtle. The Harris-Michael protocol uses a two-step mark-then-unlink to avoid concurrent readers losing track of where they are. Naive implementations that try a single CAS to unlink can produce ABA bugs where a reader sees a stale next-pointer that's been recycled into an unrelated node. Doug Lea's ConcurrentSkipListMap source has extensive comments explaining the exact ordering of the marks.

Variants — deterministic skip lists, skip graphs, biased levelling

The family extends past the original paper.

Deterministic skip list (Munro, Papadakis, Sedgewick, 1992) eliminates the coin flip by maintaining 1-2-3 gaps between consecutive elements at each level. The trade is a worst-case O(log n) bound for somewhat more complex insertion logic. Rarely shipped in production because the probabilistic version's variance is already so tight.

Skip graph (Aspnes and Shah, 2003) generalises skip lists to peer-to-peer networks. Each node maintains pointers to neighbours at multiple levels of an overlay; search routes through the overlay in O(log n) hops. Used in early P2P research; never quite became the default DHT (Chord, Kademlia, Pastry took that role) but the structural ideas show up in DHT analysis.

Biased skip list (Bagchi, Buchsbaum, Goodrich, 2002) lets high-weight elements have higher promotion probabilities, giving access-frequency-weighted optimality similar to splay trees but without the rebalancing on read.

Bw-skip list and cache-conscious skip list are variants that lay out nodes for cache friendliness, packing multiple keys per lane node to amortise cache-line transfers. Used in some HPC contexts; rarely in databases.

Indexable skip list (the Redis rank trick) attaches a span count to each forward pointer so rank-by-position becomes O(log n). Standard outside Redis too, though most CS textbooks omit it.

Further reading on skip lists

Primary sources, in order.

Found this useful?