01 / 14
Internals / 01

B-trees

Bayer & McCreight, 1972. Five decades on, every mainstream relational database (Postgres, MySQL, SQLite, SQL Server, Oracle) still ships a B-tree as its primary index. Wide fan-out, shallow trees, predictable cache behaviour, in-place updates. Here is what the structure actually looks like, why splits work, and what every storage engine quietly tunes.


B-tree, B+ tree, B* tree —
three variants, one idea.

The original 1972 paper described what we now call the B-tree: keys and values stored in every node, including internal nodes. Disk seeks were the bottleneck, and storing values close to the search path saved seeks.

A few years later, Knuth's variant, the B+ tree, moved all values to the leaves. Leaves form a doubly-linked list. Internal nodes hold only routing keys. This trades a little space for two big wins: range scans become a leaf-walk instead of a tree-walk, and internal nodes hold more keys per page so the tree is shorter.

The B* tree tweaks split-at-fill rules to push branches to ⅔ full instead of ½, at the cost of more complex split logic. It's largely historical now. Modern engines pick B+.

Which one ships? Postgres heap tables use B+ for indexes (the heap itself is a separate structure). InnoDB stores rows clustered in a B+ tree keyed by primary key. SQLite is B+. SQL Server is B+. Pretty much everyone is B+.

A page is the unit of everything.

A B-tree node is one disk page. Most engines use 8 KB or 16 KB pages. Postgres defaults to 8, InnoDB to 16. The choice trades scan throughput against write-amplification under random updates.

Inside the page, the layout is roughly:

+----------------------------------+ offset 0
| page header (level, flags, lsn)  |
+----------------------------------+
| slot array (offsets into payload)|   ← grows down
|        ↓                         |
|        ↓                         |
|                                  |
|        ↑                         |
|        ↑                         |
| key/value payload                |   ← grows up
+----------------------------------+ offset PAGESIZE

The slot array gives binary-searchable, fixed-size pointers to variable-length keys. The payload grows the other way. The page is "full" when the gap between them closes.

Fan-out and tree height.

A 16 KB internal page holding 16-byte keys + 6-byte child pointers fits roughly 700 children. So a three-level tree indexes ~700³ ≈ 340 million rows. A four-level tree indexes hundreds of billions.

Real engines tune for "no internal page is ever evicted from RAM." With a 16 GB buffer pool that's about 10⁶ pages, easily enough to keep all internal pages of a B+ tree covering tens of billions of rows resident. Every point-lookup becomes one disk seek (the leaf), or zero if the leaf is also cached.

Tree heightRows indexed (16KB pages, 16B keys)Disk seeks per lookup
2 (root + leaves)~700,0000–1
3~500 million0–1 (internal cached)
4~340 billion0–1 (internal cached)
5~240 trillion1–2

Splits — the whole game.

An insert into a leaf that has space writes the key in place. An insert into a full leaf triggers a split: allocate a new sibling page, redistribute roughly half the keys to it, install a new separator key in the parent. If the parent is also full, it splits too. In the worst case the split propagates to the root and the tree grows by one level.

Two split policies matter:

  • 50/50 split. Clean math, even fill, but for monotonically-increasing keys (timestamps, autoincrement IDs) it leaves every left page half-empty forever.
  • Asymmetric / right-most split. When the inserted key is the largest in the page, leave the existing page full and start the new one with just the new key. InnoDB calls this "page-split optimization for sequential inserts." Postgres has similar logic in nbtree.
Why this matters. A schema with a UUIDv4 primary key fragments the B+ tree (every insert is random and triggers split-the-middle). Switching to UUIDv7 or autoincrement gives you the asymmetric-split optimization and 2–3× faster bulk inserts.

Latching — concurrent access.

The classic B-tree algorithm assumes single-threaded access. To let many transactions read and write at once, engines layer "latches" (lightweight in-memory locks) on top of pages. The naive scheme, lock every page on the path from root to leaf, kills concurrency: every reader and writer contends on the root.

Three improvements ship in production:

  • Lock coupling (crabbing). Acquire the child latch before releasing the parent. Walks proceed top-down without blocking concurrent walks for long.
  • Optimistic descent. Walk to the leaf with read-latches only. If the leaf has space and the insert doesn't trigger a split, commit. Otherwise restart pessimistically.
  • Lehman-Yao B-link tree. Each node has a "right-link" to its right sibling. A reader that finds the key it wants past the high-key of the current page follows the right-link instead of restarting. Lets concurrent splits coexist with readers without blocking. Postgres's nbtree uses this.

Copy-on-write B-trees.

LMDB, BoltDB, and the etcd backend use a copy-on-write variant. Instead of mutating a page in place, allocate a new page with the change, allocate a new parent that points to it, recurse to the root. Atomic commit becomes "swap the root pointer."

Trade-offs vs in-place B-trees:

PropertyIn-place B+ treeCoW B-tree
Crash recoveryWAL replayNone — old root is consistent
Write amplification1–3× (page + WAL)~logfanout(N)
Concurrent writersManySingle writer
Read snapshot isolationVia MVCC layerFree — old roots persist
Long-running readersCheapHold pages from being freed

When B-trees hurt.

  • Random-key heavy write workloads. Every insert may trigger a split, and every split is a random page write. LSM-trees absorb this with sequential writes. See the LSM deep dive.
  • Write amplification. A 100-byte row update may dirty an 8 KB page (80×) plus a WAL record (a few hundred bytes). On flash, write-amp shortens device life and consumes IOPS budget.
  • Index maintenance after bulk delete. Pages don't merge eagerly when rows leave. Postgres's VACUUM exists in part to reclaim B-tree pages. Long-running transactions block this.

Further reading.

Found this useful?