B-trees
Bayer & McCreight, 1972. Five decades on, every mainstream relational database (Postgres, MySQL, SQLite, SQL Server, Oracle) still ships a B-tree as its primary index. Wide fan-out, shallow trees, predictable cache behaviour, in-place updates. Here is what the structure actually looks like, why splits work, and what every storage engine quietly tunes.
B-tree, B+ tree, B* tree —
three variants, one idea.
The original 1972 paper described what we now call the B-tree: keys and values stored in every node, including internal nodes. Disk seeks were the bottleneck, and storing values close to the search path saved seeks.
A few years later, Knuth's variant, the B+ tree, moved all values to the leaves. Leaves form a doubly-linked list. Internal nodes hold only routing keys. This trades a little space for two big wins: range scans become a leaf-walk instead of a tree-walk, and internal nodes hold more keys per page so the tree is shorter.
The B* tree tweaks split-at-fill rules to push branches to ⅔ full instead of ½, at the cost of more complex split logic. It's largely historical now. Modern engines pick B+.
A page is the unit of everything.
A B-tree node is one disk page. Most engines use 8 KB or 16 KB pages. Postgres defaults to 8, InnoDB to 16. The choice trades scan throughput against write-amplification under random updates.
Inside the page, the layout is roughly:
+----------------------------------+ offset 0
| page header (level, flags, lsn) |
+----------------------------------+
| slot array (offsets into payload)| ← grows down
| ↓ |
| ↓ |
| |
| ↑ |
| ↑ |
| key/value payload | ← grows up
+----------------------------------+ offset PAGESIZEThe slot array gives binary-searchable, fixed-size pointers to variable-length keys. The payload grows the other way. The page is "full" when the gap between them closes.
Fan-out and tree height.
A 16 KB internal page holding 16-byte keys + 6-byte child pointers fits roughly 700 children. So a three-level tree indexes ~700³ ≈ 340 million rows. A four-level tree indexes hundreds of billions.
Real engines tune for "no internal page is ever evicted from RAM." With a 16 GB buffer pool that's about 10⁶ pages, easily enough to keep all internal pages of a B+ tree covering tens of billions of rows resident. Every point-lookup becomes one disk seek (the leaf), or zero if the leaf is also cached.
| Tree height | Rows indexed (16KB pages, 16B keys) | Disk seeks per lookup |
|---|---|---|
| 2 (root + leaves) | ~700,000 | 0–1 |
| 3 | ~500 million | 0–1 (internal cached) |
| 4 | ~340 billion | 0–1 (internal cached) |
| 5 | ~240 trillion | 1–2 |
Splits — the whole game.
An insert into a leaf that has space writes the key in place. An insert into a full leaf triggers a split: allocate a new sibling page, redistribute roughly half the keys to it, install a new separator key in the parent. If the parent is also full, it splits too. In the worst case the split propagates to the root and the tree grows by one level.
Two split policies matter:
- 50/50 split. Clean math, even fill, but for monotonically-increasing keys (timestamps, autoincrement IDs) it leaves every left page half-empty forever.
- Asymmetric / right-most split. When the inserted key is the largest
in the page, leave the existing page full and start the new one with just the new key.
InnoDB calls this "page-split optimization for sequential inserts." Postgres has
similar logic in
nbtree.
Latching — concurrent access.
The classic B-tree algorithm assumes single-threaded access. To let many transactions read and write at once, engines layer "latches" (lightweight in-memory locks) on top of pages. The naive scheme, lock every page on the path from root to leaf, kills concurrency: every reader and writer contends on the root.
Three improvements ship in production:
- Lock coupling (crabbing). Acquire the child latch before releasing the parent. Walks proceed top-down without blocking concurrent walks for long.
- Optimistic descent. Walk to the leaf with read-latches only. If the leaf has space and the insert doesn't trigger a split, commit. Otherwise restart pessimistically.
- Lehman-Yao B-link tree. Each node has a "right-link" to its right
sibling. A reader that finds the key it wants past the high-key of the current page
follows the right-link instead of restarting. Lets concurrent splits coexist with
readers without blocking. Postgres's
nbtreeuses this.
Copy-on-write B-trees.
LMDB, BoltDB, and the etcd backend use a copy-on-write variant. Instead of mutating a page in place, allocate a new page with the change, allocate a new parent that points to it, recurse to the root. Atomic commit becomes "swap the root pointer."
Trade-offs vs in-place B-trees:
| Property | In-place B+ tree | CoW B-tree |
|---|---|---|
| Crash recovery | WAL replay | None — old root is consistent |
| Write amplification | 1–3× (page + WAL) | ~logfanout(N) |
| Concurrent writers | Many | Single writer |
| Read snapshot isolation | Via MVCC layer | Free — old roots persist |
| Long-running readers | Cheap | Hold pages from being freed |
When B-trees hurt.
- Random-key heavy write workloads. Every insert may trigger a split, and every split is a random page write. LSM-trees absorb this with sequential writes. See the LSM deep dive.
- Write amplification. A 100-byte row update may dirty an 8 KB page (80×) plus a WAL record (a few hundred bytes). On flash, write-amp shortens device life and consumes IOPS budget.
- Index maintenance after bulk delete. Pages don't merge eagerly when
rows leave. Postgres's
VACUUMexists in part to reclaim B-tree pages. Long-running transactions block this.
Further reading.
- Bayer & McCreight (1972) — Organization & Maintenance of Large Ordered Indexes — the original paper.
- Lehman & Yao (1981) — Efficient Locking for Concurrent Operations on B-Trees — the right-link trick that ships in Postgres.
- postgres/nbtree — a clean, heavily-commented production implementation of a B-link tree.
- LevelDB design — for comparison: how an LSM differs.
- Petrov — Database Internals, ch. 2–3 — the most readable modern walk-through.
- CMU 15-445 — Tree Indexes lectures — Andy Pavlo, recorded annually.
- Semicolony — How CPU caches work — B-tree page reads are dominated by L2/L3 hit rate. Why an 8 KB page sits in two cache lines and why hot index pages stay resident in L3 across queries.
- Semicolony — How SSDs work — B-tree's in-place update pattern produces higher write amplification on flash than LSM does. Why MySQL/InnoDB has a "doublewrite buffer" and what that costs the FTL.