13 · 16 steps
Visualize / 13

A B-tree, insert by insert.

Every database index you\'ve ever queried sits on top of a B-tree (or its close cousin, the B+ tree). Watch one grow from a single empty leaf to a three-level tree by inserting keys one at a time. When a leaf fills, it splits and shoves its median key up to the parent. When the root fills and splits, the tree gains a level — the only way a B-tree ever gets taller.


step 1 / 16 · 0 keys · 1 nodes · height 1
building order-4 B-tree · insert 10, 20, 30, 40, 50, 25, 35, 5, 60, 70, 15, 22, 24, 23 max 4 keys/node · split at median · root grows only when forced
LEAF
INTERNAL
FULL
JUST INSERTED
Empty tree · just a single empty leaf

A B-tree with order 4 holds up to 4 keys per node and up to 5 children per internal node. Start with one empty node, which is both the root and a leaf. We'll insert keys one at a time and watch what happens when nodes overflow.

Order
The fan-out of a B-tree. Order-4 means 4 keys per node max (some sources count differently — Knuth defines order as max children, so this would be order-5 his way).
Leaf vs internal node
Leaves hold actual records (or pointers to records). Internal nodes only hold separator keys and child pointers — used to navigate down.
Why B-trees
Designed for disks. Wide fanout = few levels = few disk seeks. A B-tree of fanout 100 holds a billion keys in 5 levels. Hence "B" = "balanced" or "Bayer" or just because B was free.

Why splits propagate upward, not down

A B-tree grows from the bottom. Splits move keys UP a level, never down. That\'s why the tree stays balanced — every leaf is always the same distance from the root. Other tree variants (AVL, red-black) need rotations to stay balanced. B-trees skip rotations entirely; the splitting invariant guarantees balance for free. The cost is that splits write more bytes than a rotation would, but on disk you\'re writing whole pages anyway, so it\'s a wash.

B-tree vs B+ tree — what databases actually use

Most databases use B+ trees, not classic B-trees. The difference: in a B+ tree, all values live in the leaves. Internal nodes hold only the separator keys used for routing. The leaves are also linked together in a doubly-linked list. Why: range scans (WHERE id BETWEEN 100 AND 200) become a single leaf-walk after finding the start. Postgres, MySQL InnoDB, SQLite, Oracle, and most modern key-value engines all use this variant.

LSM trees (LevelDB, RocksDB, Cassandra) take the opposite approach: write everything to a memtable, periodically flush to immutable on-disk files, merge them in the background. Writes are dramatically faster; reads can be slower if the value isn\'t in the memtable. Choose-your-fighter: B-trees for read-heavy workloads, LSMs for write-heavy.

What "order 250" buys you

A real on-disk B-tree node fills an entire disk page — 4 KB or 16 KB typically. With 8-byte keys and 8-byte child pointers, an 8 KB node fits ~500 keys. Call it 250 conservatively. A tree of that fanout holds:

  • 2 levels: ~62,500 keys
  • 3 levels: ~16 million keys
  • 4 levels: ~4 billion keys
  • 5 levels: ~1 trillion keys

Each level is one disk seek (~10 ms on spinning disk, ~100 µs on SSD). Lookup in a trillion-key index: 5 seeks. The whole point of B-trees is reducing the I/O count, and the wide fanout is what does it. The top 1-2 levels almost always fit in RAM cache, so the typical lookup is more like 3 seeks even in pathological cases.

What happens on delete

The mirror of insertion: when a node falls below half-full (the underflow threshold), either borrow a key from a sibling or merge two siblings into one. Merges can propagate upward the same way splits do. In practice many databases skip aggressive rebalancing and let nodes get sparse — it\'s cheaper to leave a half-empty node than to merge eagerly. Periodic vacuum / rebuild processes clean up the sparseness over time.

Common gotchas

  • Hot leaf splits. Inserting always-increasing keys (like timestamps or auto-incrementing IDs) means only the rightmost leaf ever splits. Most of the tree is half-empty. Sequential-insert optimization helps; some engines re-balance proactively.
  • Long keys. Variable-length string keys reduce fanout. A 50-byte key in a 4 KB node = ~80 keys per node instead of 500. Trees get taller. Postgres truncates index keys past a limit for this reason.
  • Write amplification. Every split rewrites at least two pages plus the parent. High-churn workloads can amplify a single logical write into 5-10 page writes. SSDs particularly hate this — affects their wear.
  • Lock contention. Splits hold locks up the path while they propagate. High-concurrency workloads can stall on this. Latch-coupling and optimistic descent are the standard fixes.

What this simplifies

  • Order 4 instead of order 250. Real fanout is much wider; pictures would be unreadable. The split logic is identical.
  • No values shown. In a B+ tree the leaves hold actual records or pointers. We only show keys.
  • No leaf linking. B+ tree leaves are connected in a list for range scans. We omitted the back-pointers.
  • No concurrency. Real B-trees have to deal with multiple readers and writers at once — latches, optimistic protocols, split-pointer-update ordering. Big topic on its own.
Go deeper

Databases Codex →

Index internals, B-tree vs LSM trade-offs, query planners, WAL + page cache + B-tree interaction, why Postgres splits pages the way it does.

Open the Codex →
Found this useful?