A B-tree, insert by insert.
Every database index you\'ve ever queried sits on top of a B-tree (or its close cousin, the B+ tree). Watch one grow from a single empty leaf to a three-level tree by inserting keys one at a time. When a leaf fills, it splits and shoves its median key up to the parent. When the root fills and splits, the tree gains a level — the only way a B-tree ever gets taller.
order-4 B-tree · insert 10, 20, 30, 40, 50, 25, 35, 5, 60, 70, 15, 22, 24, 23 max 4 keys/node · split at median · root grows only when forcedA B-tree with order 4 holds up to 4 keys per node and up to 5 children per internal node. Start with one empty node, which is both the root and a leaf. We'll insert keys one at a time and watch what happens when nodes overflow.
- Order
- The fan-out of a B-tree. Order-4 means 4 keys per node max (some sources count differently — Knuth defines order as max children, so this would be order-5 his way).
- Leaf vs internal node
- Leaves hold actual records (or pointers to records). Internal nodes only hold separator keys and child pointers — used to navigate down.
- Why B-trees
- Designed for disks. Wide fanout = few levels = few disk seeks. A B-tree of fanout 100 holds a billion keys in 5 levels. Hence "B" = "balanced" or "Bayer" or just because B was free.
Why splits propagate upward, not down
A B-tree grows from the bottom. Splits move keys UP a level, never down. That\'s why the tree stays balanced — every leaf is always the same distance from the root. Other tree variants (AVL, red-black) need rotations to stay balanced. B-trees skip rotations entirely; the splitting invariant guarantees balance for free. The cost is that splits write more bytes than a rotation would, but on disk you\'re writing whole pages anyway, so it\'s a wash.
B-tree vs B+ tree — what databases actually use
Most databases use B+ trees, not classic B-trees. The difference: in a B+ tree, all values live in the leaves. Internal nodes hold only the separator keys used for routing. The leaves are also linked together in a doubly-linked list. Why: range scans (WHERE id BETWEEN 100 AND 200) become a single leaf-walk after finding the start. Postgres, MySQL InnoDB, SQLite, Oracle, and most modern key-value engines all use this variant.
LSM trees (LevelDB, RocksDB, Cassandra) take the opposite approach: write everything to a memtable, periodically flush to immutable on-disk files, merge them in the background. Writes are dramatically faster; reads can be slower if the value isn\'t in the memtable. Choose-your-fighter: B-trees for read-heavy workloads, LSMs for write-heavy.
What "order 250" buys you
A real on-disk B-tree node fills an entire disk page — 4 KB or 16 KB typically. With 8-byte keys and 8-byte child pointers, an 8 KB node fits ~500 keys. Call it 250 conservatively. A tree of that fanout holds:
- 2 levels: ~62,500 keys
- 3 levels: ~16 million keys
- 4 levels: ~4 billion keys
- 5 levels: ~1 trillion keys
Each level is one disk seek (~10 ms on spinning disk, ~100 µs on SSD). Lookup in a trillion-key index: 5 seeks. The whole point of B-trees is reducing the I/O count, and the wide fanout is what does it. The top 1-2 levels almost always fit in RAM cache, so the typical lookup is more like 3 seeks even in pathological cases.
What happens on delete
The mirror of insertion: when a node falls below half-full (the underflow threshold), either borrow a key from a sibling or merge two siblings into one. Merges can propagate upward the same way splits do. In practice many databases skip aggressive rebalancing and let nodes get sparse — it\'s cheaper to leave a half-empty node than to merge eagerly. Periodic vacuum / rebuild processes clean up the sparseness over time.
Common gotchas
- Hot leaf splits. Inserting always-increasing keys (like timestamps or auto-incrementing IDs) means only the rightmost leaf ever splits. Most of the tree is half-empty. Sequential-insert optimization helps; some engines re-balance proactively.
- Long keys. Variable-length string keys reduce fanout. A 50-byte key in a 4 KB node = ~80 keys per node instead of 500. Trees get taller. Postgres truncates index keys past a limit for this reason.
- Write amplification. Every split rewrites at least two pages plus the parent. High-churn workloads can amplify a single logical write into 5-10 page writes. SSDs particularly hate this — affects their wear.
- Lock contention. Splits hold locks up the path while they propagate. High-concurrency workloads can stall on this. Latch-coupling and optimistic descent are the standard fixes.
What this simplifies
- Order 4 instead of order 250. Real fanout is much wider; pictures would be unreadable. The split logic is identical.
- No values shown. In a B+ tree the leaves hold actual records or pointers. We only show keys.
- No leaf linking. B+ tree leaves are connected in a list for range scans. We omitted the back-pointers.
- No concurrency. Real B-trees have to deal with multiple readers and writers at once — latches, optimistic protocols, split-pointer-update ordering. Big topic on its own.
Databases Codex →
Index internals, B-tree vs LSM trade-offs, query planners, WAL + page cache + B-tree interaction, why Postgres splits pages the way it does.
Open the Codex →