B-Tree Simulator: where databases live.

A B-tree keeps data sorted so search, insert, and delete stay O(log n) even at billions of rows: many keys per node, high fan-out, shallow height. It's the structure every relational database reaches for once an index outgrows memory.

Order
4
Height
1
Nodes
1
Keys
0

Order (max keys per node)
Operate
Presets
Range scan
Tree
leaf internal last touched
Trace
— quiet —

What you're looking at

Each box is a node holding up to order keys in sorted order; green-edged boxes are leaves, dark-edged ones are internal routing nodes. Insert a key and it lands in a leaf. When that leaf is already full, it splits: the median key climbs into the parent and the rest divides into two siblings. The trace below the tree narrates every search path, and the counters up top track height, node count, and key count as the tree reshapes itself.

Set order to 3 and insert 10, 20, 30, 40, 50, 60, 70 in that exact order. A binary search tree fed sorted input degenerates into a linked list; here the tree splits at 30, splits again at 50, and stays perfectly balanced with no special handling. Watch where the growth happens: a B-tree never gets taller at the leaves. It grows by splitting the root, so every leaf stays at the same depth forever. Then delete a few keys and watch the reverse moves — borrowing from a sibling, or merging two nodes back into one.


What is a B-tree?

When the index is bigger than memory.

A B-tree is a self-balancing tree data structure that keeps data sorted and supports search, insertion, and deletion in O(log n) time, optimised for systems that read and write large blocks. Rudolf Bayer and Edward McCreight introduced it in a 1970 Boeing Research technical report (published 1972). Five decades later, every mainstream relational database — PostgreSQL, MySQL, SQLite, SQL Server, Oracle — still uses a B-tree (or its B+ tree variant) as its primary index structure. Wide fan-out, low height, page-aligned reads.

Imagine you are running the database behind a billion-row table — orders, users, sensor readings, take your pick — and you want to answer a single innocent-sounding query: find the row with id = 42,857,193 in under a millisecond, even when the same query has never been seen before. The data is on disk. The index is on disk. Your buffer cache is large but cannot fit the whole index. Each random read from disk takes 100 microseconds on a fast SSD; each cache miss is a real penalty.

If your index is a sorted array on disk, you can binary-search it. That is log₂ 10⁹ ≈ 30 comparisons, which means up to thirty random reads — three milliseconds, missed budget. If your index is an in-memory red-black tree, the structure is fine but it does not fit on disk in any natural way and rebuilding it after a crash is impossible. If your index is an unbalanced binary search tree, sorted insertions degenerate it into a linked list and a single lookup costs a billion reads. None of the obvious candidates work.

The B-tree was designed for exactly this problem. Its central trick is to make each node large enough to fill one filesystem page — typically 8 KB on Postgres, 16 KB on InnoDB — and pack hundreds of keys into each page. A single disk read brings back enough information to make hundreds of branching decisions in cache; the tree's height collapses from log₂ n to log₂₀₀ n. A billion-row index that would be 30 levels deep as a binary tree becomes four levels deep as a B-tree. Four random reads is 400 microseconds, comfortably inside any latency budget.

The structure also keeps itself balanced under any insertion order. Every leaf sits at the same depth from the root; splits propagate upward until the root itself splits and the tree gains a level. There is no degenerate case — sorted inserts, random inserts, adversarial inserts all produce trees of the same shape. That guarantee is what makes B-trees the default index for every relational database written since 1972: Postgres, MySQL InnoDB, Oracle, SQL Server, SQLite, Berkeley DB, LMDB, BoltDB, FoundationDB, etcd's bbolt backend, MongoDB WiredTiger, and SQL Server's columnstore directory all use a B-tree variant. The data structure is older than UNIX, older than the C language, and shows no sign of being displaced.

The simulator above lets you insert keys, watch nodes split when they reach capacity, and observe how the median key migrates upward into the parent. Try inserting 10, 20, 30, 40, 50, 60, 70 in order with order-3 capacity: the first split happens at 30, the second at 50; the tree quietly grows from one level to two without ever becoming a stick. Concrete numbers worth carrying: a Postgres B-tree page (8 KB) holds about 200–400 routing keys; InnoDB's 16 KB page holds 100–300; the tree's height for a billion entries is therefore three or four; a single point lookup costs three or four random reads regardless of how the data was inserted.

DISK PAGE = ONE READ · ONE READ FETCHES MANY KEYS8 KB PAGE 10 28 41 60 78 95 112 130 147 168 189 205 222 240200–400 ROUTING KEYS PER PAGE · BINARY SEARCH IN-CACHEpagepagepagepageBILLION ROWS · LOG_400(10⁹) ≈ 4 LEVELS · 4 RANDOM READS PER LOOKUP

B-tree vs B+ tree — the variant your database actually uses

The variant your database actually uses.

Pure B-trees store payloads in every node, internal and leaf. B+trees store routing keys only in the internal nodes; payloads live only in leaves; leaves are linked left-to-right in a doubly-linked list. The shape is otherwise identical, but the consequences are dramatic.

Why every relational engine uses B+trees instead of plain B-trees: range scans. Once you find the first leaf matching WHERE id > 1000, you can walk leaf-to-leaf without re-traversing the tree. The internal nodes also pack denser, since they hold only keys — meaning higher fan-out, shallower trees, fewer disk reads per lookup. A typical Postgres index node carries 200–400 routing keys; a typical InnoDB index node, 100–300. The internal layer becomes a coarse-grained directory pointing to the data layer.

The trade-off compared to a pure B-tree is that successful point lookups always traverse to a leaf — there's no opportunity to terminate early at an internal node that happens to hold the value. For random-access workloads this costs one extra page read on average. For range scans and ordered iteration the savings dominate.

2560INTERNAL · routing keys only10:rowA22:rowB25:rowC48:rowD60:rowE82:rowFLEAVES · payloads + sibling links → enable cheap range scans

Origins — Bayer and McCreight, 1972

Bayer and McCreight, 1972.

The B-tree was published by Rudolf Bayer and Edward McCreight in Organization and Maintenance of Large Ordered Indexes at Boeing Scientific Research Labs in 1972. The motivating problem was concrete: indexes had to live on rotating disk because main memory was tiny by today's standards, and disk was about a million times slower per access than RAM. Any data structure that fetched one node per pointer dereference was unusable at scale, because each fetch could be ten milliseconds. The B-tree's contribution was the realisation that nodes could be made arbitrarily large — and should be — so a single disk read brought back enough information to make many comparison decisions. With high fan-out, even a billion-record index becomes only four or five levels deep.

Bayer himself later said the "B" was deliberately ambiguous. It might stand for Boeing, for balanced, for Bayer, or for none of the above. The paper itself never spelled it out. What matters is the structural invariant: every leaf sits at the same depth from the root. The tree never becomes long-and-stringy the way an unbalanced binary search tree does after sorted insertions. Splits push the median key up to the parent, which in turn might split, and the only way for the tree to grow taller is for the root itself to split — guaranteeing perfect balance at every level.

That guarantee is what makes B-trees the durable choice for ordered indexes, more than half a century after the paper. Postgres, MySQL InnoDB, Oracle, SQL Server, SQLite, Berkeley DB, LMDB, BoltDB, FoundationDB's storage engine, etcd's mvcc backend (bbolt), Cassandra's secondary indexes, MongoDB WiredTiger, and SQL Server columnstore's row-group directory all use a B-tree variant for their primary or secondary indexes. The data structure is older than UNIX, older than the C language, and shows no sign of being displaced.

B-tree operations — search, insert, delete, all in O(log n)

Three operations, all logarithmic.

Search walks down from the root, performing a binary search inside each node to choose the next child to follow, until it lands on a leaf. The traversal visits one node per level. With order m and n keys total, height is O(log_⌈m/2⌉ n). With order 200 and a billion keys, that is fewer than four levels — meaning under four random page reads to locate any row, regardless of insertion history.

Insert is search plus a possible split. The CLRS textbook variant — used by this simulator — performs a preemptive split: while descending, if the next child is already at maximum capacity, split it before descending into it. This guarantees that when we finally land at the leaf, the leaf has room. The advantage is that the algorithm only ever touches the path from root to leaf — no second traversal back up to propagate splits. The disadvantage is that it splits some nodes unnecessarily; an insert into a barely-full subtree might split it even though the new key would have landed elsewhere. In practice the difference is invisible.

Delete is the algorithm everyone glosses over in the textbook and gets wrong in production. The CLRS recipe descends recursively, and at each step ensures that the child it's about to descend into has at least ⌈m/2⌉ keys — borrowing from a sibling if possible, merging two siblings if not. By the time the recursion lands at a leaf, the leaf is guaranteed to have a key it can safely remove. This "fix on the way down" approach mirrors the preemptive split for insert, and like split it avoids a second pass.

The borrow-or-merge alternation is what keeps the tree balanced. Borrowing rotates one key from a richer sibling through the parent into the lean child; the parent's separator key shifts to maintain order. Merging combines two skinny siblings plus their separator into one node; the parent loses a key and a child pointer, which can in turn force the parent to rebalance. If merging cascades all the way up to the root, the root may lose its last key, in which case its sole remaining child becomes the new root and the tree shrinks by one level — the only way a B-tree gets shorter.

OperationTimeI/O costNotes
Search (point lookup)O(log n)1 page per levelbinary search inside each node is in-cache once page is in
Search (range scan)O(log n + k)1 page per level + k/B leaf pagesk = matches; B+tree leaf links make this fast
InsertO(log n)1 page per level (read) + dirty pages (write)amortised; splits are O(log n) worst-case
DeleteO(log n)1 page per level (read) + dirty pagesborrow / merge handled on descent
Bulk loadO(n log n)linear if input is sortedbuild leaves left-to-right, fanout up
Min / maxO(log n)1 page per levelwalk leftmost or rightmost

Page size and disk I/O — why the order isn't arbitrary

The order isn't arbitrary.

Real databases pick the B-tree order to make each node fit one filesystem page — typically 8 KB on Postgres, 16 KB on InnoDB, 4 KB on SQLite. With keys around 16 bytes, pointers around 8 bytes, and per-entry overhead, that gives fan-out of 200–500 routing keys per page. A billion-row table fits in just four levels: log_500(10⁹) ≈ 3.3. Three or four random reads to find any row.

Page sizing matters because the unit of disk I/O is a page, not a byte. Reading half a page costs the same as reading a whole one; spillage to a second page doubles the cost. The fanout you can fit in one page is the single biggest knob on read latency in a relational engine. Modern NVMe SSDs do millions of small random reads per second, but the page-aligned read still pays the full block transfer cost — and if your hot working set doesn't fit in the database's buffer pool, you pay it on every miss.

This is why MySQL InnoDB's default 16 KB page exists despite many filesystems using 4 KB blocks. The doubling lets internal nodes pack roughly twice the routing keys, halving the height for billion-row tables. The tradeoff is read amplification on small lookups: a 16-byte row read pulls 16 KB from disk, plus a partial page from the WAL, plus possibly a page from a secondary index. For OLTP workloads with aligned hot data this is fine; for analytical scans where every row is unique, columnar engines absorb the same workload more cheaply.

# Postgres internals — a B-tree page header
struct ItemIdData {
 uint32_t lp_off : 15; // offset to tuple in page
 uint32_t lp_flags: 2;
 uint32_t lp_len : 15;
};
# Pages contain: header + ItemId array growing down,
# tuples growing up, free space in between.
# Standard Postgres pages: 8192 bytes, configurable via --with-blocksize.
# InnoDB pages: 16384 bytes, fixed in mainline MySQL.

B-tree variants — B+ tree, B* tree, fractal trees

A family of tree-shaped indexes.

The textbook B-tree is a starting point. Production systems run variants tuned for specific access patterns, hardware constraints, and concurrency demands. Knowing which variant a system uses changes how you reason about its performance.

B+tree separates keys (in internal nodes) from values (in leaves). Adopted by every relational engine; the canonical choice for ordered indexes on disk. Postgres, MySQL InnoDB, Oracle, SQL Server, SQLite, Berkeley DB, LMDB. Search always reaches a leaf; range scans walk leaf-to-leaf via sibling links.

B*tree tightens the minimum-keys invariant from ⌈m/2⌉ to ⌈2m/3⌉, achieving higher density. When a node would split, it first tries to redistribute with a sibling; only when both are full does a 2-into-3 split occur. NTFS and HFS+ used B*trees historically. The complexity gain is roughly 10% better space utilisation at the cost of more complex insertion logic.

B-link tree, introduced by Lehman and Yao in 1981, adds a right-sibling pointer to every node and a "high key" indicating the upper bound of keys in that node. The combination lets readers descend without taking locks: if a search arrives at a node and its target key exceeds the high key, the reader walks right to the next sibling. This is the foundation of high-concurrency B-tree implementations; Postgres' nbtree code is essentially a B-link tree, citing Lehman and Yao directly in the source.

Copy-on-write B-tree never mutates a page in place. Updates clone the path from root to leaf; the new root atomically replaces the old. LMDB (Symas), BoltDB, etcd's bbolt, and FoundationDB's storage engine use this approach. The advantages: trivial crash safety (the old tree remains valid until the new root commits), trivial snapshots (any old root is a consistent snapshot), and no need for a separate write-ahead log. The disadvantages: every write touches O(log n) pages even for a single-key update, doubling space usage during long-running readers.

Bw-tree, designed at Microsoft Research for Hekaton, replaces in-place page updates with a delta chain. Updates append a small delta record to a per-page log; periodically the system consolidates the deltas back into a fresh base page. Latch-free reads, optimistic CAS-based writes, friendly to NVRAM. Used in SQL Server's in-memory OLTP engine and in CMU's NoisePage research system.

Fractal tree (Tokutek, now part of Percona) uses message buffers at internal nodes. Inserts append to the buffer at the root; messages flush down when buffers fill. Drastically reduces write amplification at the cost of more complex compaction. TokuDB and TokuMX shipped commercially before LSM trees won the write-heavy market.

Prefix B-tree compresses repeated key prefixes within a node. Useful when keys share long common prefixes — URL paths, namespaced identifiers, multi-column composite indexes. InnoDB and Postgres both use prefix compression in the leaf level under specific conditions; it can double the effective fan-out on text-heavy keys.

VariantUsed byKey property
B-tree (classic)textbook, this simulatorvalues in every node
B+treePostgres, InnoDB, Oracle, SQLite, LMDBvalues only in leaves; sibling links for scans
B*treeNTFS, HFS+, some IBM DB22-into-3 splits; 67% min density
B-link treePostgres nbtreeright-sibling pointer; latch-free reads
Copy-on-write B-treeLMDB, BoltDB, FoundationDBimmutable pages; root swap on commit
Bw-treeSQL Server Hekaton, NoisePagedelta chains; latch-free
Fractal treeTokuDB, TokuMX (Percona)buffered inserts; lower write amplification
Prefix B-treeInnoDB, Postgres (leaf level)shared prefix compression per node

B-tree concurrency — many readers, many writers, one tree

Many readers, many writers, one tree.

A textbook B-tree assumes a single thread. A production B-tree services hundreds of concurrent transactions, each potentially descending or modifying the same path. Three families of concurrency control dominate: lock coupling, B-link trees, and optimistic / lock-free schemes.

Lock coupling (also called crabbing) takes a latch on the parent before the child, and releases the parent latch only when it has confirmed that no modification at the child will need to propagate upward. For reads this is straightforward: a shared latch on the next page, then release the previous shared latch. For writes it's more delicate: take an exclusive latch on the path until you can prove no further split or merge is needed. The classical hazard is that a worst-case insertion holds latches all the way to the root, throttling other writers.

B-link trees, the Lehman-Yao approach, eliminate the upward latching by giving every node a right-sibling pointer and a high key. Readers never take locks; if a reader descends to a node and finds its target key exceeds the high key, it walks right via the sibling pointer. Writers still take latches but only briefly, on the specific page being modified. The result is a B-tree that scales nearly linearly with cores for read-heavy workloads. Postgres adopted this design in the 1990s and the src/backend/access/nbtree source still references the 1981 paper.

Optimistic schemes go further: writers don't take latches at all. They read a version, compute the modification, and CAS the result into place. If the CAS fails because someone else modified the page in the meantime, the writer retries. Bw-tree exemplifies this approach. The trade-off is that contended writes retry repeatedly, while uncontended ones are much faster than any latch-based scheme.

Buffer pool latches vs row locks

Don't conflate the latches on B-tree pages with the row-level locks visible to your queries. Page latches are very brief (microseconds) and exist solely to keep the in-memory representation of a page consistent during reads and writes. Row locks are longer-lived (transaction scope) and enforce isolation. A query can hold a row lock for seconds without blocking other readers of the same page.

B-tree write amplification — when random writes break locality

Random writes break locality.

B-trees mutate in place. A million random updates touch a million different leaf pages — every write is a random disk write. On rotating disks this was painful; on SSDs it remains expensive (write amplification, sometimes 5–20× the user's data on flash, because the SSD's flash translation layer relocates pages internally to wear-level the medium).

This is why LSM trees (Cassandra, RocksDB, LevelDB, ScyllaDB, Pebble, Bigtable's Tablet) win for write-heavy workloads — they batch writes into sorted in-memory tables (memtables) and flush them sequentially to immutable on-disk segments (SSTables). Reads pay the cost: check memtable, then several SSTables in order, possibly merging values across levels. Bloom filters at each level cut the average number of SSTables consulted per read. Compaction periodically merges SSTables to keep read cost bounded.

The trade-off — read amplification for write amplification — is the central tension between OLTP storage engines. For OLTP (mixed read/write, typically more reads), B-trees still win. For logs, time series, queue stores, and write-heavy workloads, LSM dominates. The boundary keeps moving as flash gets cheaper and faster, but the framing remains useful.

A hybrid approach: WiredTiger (MongoDB's default storage engine since 4.0) is fundamentally a B+tree but uses techniques from the LSM playbook for log writes and supports both row and column storage. SQL Server's columnstore indexes use a delta store (effectively an LSM frontend) feeding into compressed column segments. The pure-B-tree-vs-pure-LSM divide is increasingly an oversimplification of what production storage engines actually do.

PropertyB+treeLSM tree
Write amplification1× (page-aligned writes only)5–30× (compaction)
Read amplification1× (one path per lookup)1–10× (multiple SSTables consulted)
Space amplification~1.5× (fragmentation, free space)1.1–2× (depending on compaction policy)
Range scanfast (sibling links)slower (merge across SSTables)
Best atOLTP, mixed read-writewrites, time series, append-only
Used byPostgres, InnoDB, Oracle, LMDBRocksDB, Cassandra, ScyllaDB, Pebble

Bulk loading — building a B-tree from scratch

Building a B-tree from scratch.

A normal B-tree insertion is O(log n), but many small-page splits along the way leave the tree about 70% full on average — that's the steady-state load factor when keys arrive in random order. When you're bulk-loading a freshly created index against an existing table, this is wasteful. Bulk-load algorithms can do dramatically better.

The standard technique is sort-then-build-bottom-up. First, sort all the keys (external sort if they don't fit in memory; CREATE INDEX in Postgres does this with the maintenance_work_mem buffer). Then walk the sorted stream, packing each leaf to capacity (or near-capacity, leaving some headroom for future inserts), and emit the leaves in order. The leaves themselves form a sorted stream of "highest key per leaf"; a second pass packs those into internal nodes, then a third pass packs those into the next level up, and so on until you produce a single root. The result is a tree at 100% leaf density (or whatever fill factor you specified) with a guaranteed minimum height.

Postgres implements this for CREATE INDEX with the fillfactor storage parameter controlling the leaf density (default 90%, configurable to 10–100). MySQL InnoDB does similar work for offline ALTER TABLE ADD INDEX. SQLite's PRAGMA incremental_vacuum rebuilds indexes in place.

The catch with bulk-load is that it produces a tightly packed tree which has minimal headroom for future inserts. If the workload is append-only or read-mostly, that's exactly what you want. If the workload is heavy random insertion, that tightly-packed tree splits aggressively for the first several million inserts as it reorganises into the steady-state 70% density. The fillfactor knob lets you tune for your expected workload.

Tuning B-trees in production — five knobs that matter

Five knobs that actually move the needle.

The first knob is page size. Postgres lets you compile with a non-default block size (default 8 KB; rebuilds with 4, 16, 32 KB are possible). InnoDB exposes innodb_page_size as 4, 8, 16, 32, 64 KB. Larger pages mean higher fan-out and lower height, but more bytes per random read on a miss and more contention per page latch. For OLTP at typical row sizes, 16 KB is the sweet spot most engines settled on.

The second is fillfactor. After a bulk load or REINDEX, leaves are nearly full. If you expect heavy insert traffic into the index, set fillfactor lower (Postgres default is 90 for B-trees) so each leaf has room before it splits. For read-only or append-only indexes, set fillfactor to 100 — denser pages, fewer levels, faster lookups.

The third is buffer pool size. The hottest pages live in memory; cold pages live on disk. Postgres' shared_buffers (default 128 MB, often increased to 25% of RAM) and InnoDB's innodb_buffer_pool_size (often 70–80% of RAM on dedicated database servers) determine how much of your indexes can be hot. The single most useful database knob, full stop.

The fourth is index ordering. Composite indexes have a leading column; B-tree access is fast for any prefix of the leading columns. CREATE INDEX ... ON t (a, b, c) serves WHERE a = ?, WHERE a = ? AND b = ?, and WHERE a = ? AND b = ? AND c = ? efficiently — but WHERE b = ? alone cannot use it. Order columns from most to least selective, or in the order most queries filter.

The fifth is vacuum and reorg. Deleted rows in Postgres leave dead tuples; the index still points to them until vacuum cleans up. Long-running transactions block vacuum, leading to bloated indexes that grow without bound. REINDEX CONCURRENTLY rebuilds the index in place. MySQL InnoDB doesn't suffer the same problem (because it uses MVCC differently) but does fragment over time; OPTIMIZE TABLE is the equivalent reorg.

A useful diagnostic

If a B-tree index is dramatically larger than a back-of-envelope estimate (rows × per-key overhead × 1.5 for fragmentation), you have bloat. Postgres exposes pgstattuple for diagnosis; MySQL InnoDB exposes information_schema.INNODB_INDEXES. A 40% bloat factor doubles your hot working set; treat it as an incident, not a quirk.

Further reading on B-trees

Primary sources, in order.

Found this useful?