Multi-page · for storage engineers

Database internals

B-trees, LSM, MVCC —
one structure at a time.

B-trees, LSM, WAL, MVCC, isolation levels, query planning, replication, sharding, distributed SQL, the buffer pool. The structures that every storage engine assembles differently. Each one explained at the level of "what's the data structure", "what's the algorithm", and "what does this mean for the SQL you write".

All fourteen sub-pages are live. Each is a long-form walk-through with engine source links, papers, and references to the canon (Hellerstein, Stonebraker, Petrov, the Postgres source).

Live deep dives

Start here.

01 Live

B-trees

The on-disk index every relational database has shipped since 1972. Page layout, splits, fan-out, latching, copy-on-write variants. Why Postgres, MySQL, and SQLite all picked the same shape.

B+ tree ·page splits ·latching ·fan-out ·CoW B-trees

Read

02 Live

LSM-trees

RocksDB, LevelDB, Cassandra, ScyllaDB — all the same shape. Memtable, immutable SSTables, leveled compaction, tombstones, bloom filters. Why writes are sequential and reads are amplified.

memtable ·SSTable ·compaction ·bloom filters ·write amplification

Read

03 Live

Write-ahead log

The first thing every durable database does. fsync, group commit, log shipping, ARIES recovery. Why the WAL is the truth and the data files are the cache.

fsync ·group commit ·ARIES ·checkpoints ·log shipping

Read

04 Live

MVCC

Multi-version concurrency control — how Postgres, Oracle, and MySQL/InnoDB give every transaction its own snapshot of the database without locking readers behind writers.

snapshot isolation ·xmin/xmax ·undo log ·vacuum ·visibility map

Read

05 Live

Isolation levels

Read uncommitted to serializable, plus the anomalies they each forbid. Dirty reads, non-repeatable reads, phantoms, write skew. What your database actually gives you when you say SERIALIZABLE.

ANSI levels ·snapshot isolation ·serializable ·write skew ·SSI

Read

06 Live

Query planner

Parser, rewriter, planner, executor. Cost estimation, statistics, join orders, hash vs merge vs nested loop, EXPLAIN ANALYZE. Why your query is slow and what the planner thought it was doing.

cost model ·join algorithms ·EXPLAIN ·statistics ·plan caching

Read

07 Live

Replication

Statement vs row vs logical, sync vs async, leader-follower vs multi-leader. The Postgres streaming replication, MySQL binlog, MongoDB oplog, Aurora's "log is the database".

logical replication ·physical replication ·replication lag ·failover ·Aurora

Read

08 Live

Sharding

Range, hash, geo, directory. Vitess, Citus, MongoDB sharding. The hot-shard problem, the resharding problem, the cross-shard transaction problem.

range sharding ·hash sharding ·Vitess ·Citus ·cross-shard txn

Read

09 Live

Distributed SQL

Spanner, CockroachDB, YugabyteDB, TiDB. How you keep ACID semantics when the data is on a hundred machines — TrueTime, hybrid logical clocks, Raft groups per range.

Spanner ·TrueTime ·CockroachDB ·Raft per range ·distributed txn

Read

10 Live

Buffer pool & page cache

Where rows actually live in RAM. The OS page cache vs the database's own buffer pool, dirty page write-back, the double-write buffer, why O_DIRECT changes the math.

buffer pool ·page cache ·dirty pages ·O_DIRECT ·double write

Read

11 Live

Vacuum & compaction

What MVCC and LSM systems do with the garbage they accumulate. Postgres autovacuum, transaction-ID wraparound, RocksDB leveled vs universal compaction, write-amplification trade-offs.

autovacuum ·XID wraparound ·compaction ·tombstones ·write amplification

Read

12 Live

Time-series databases

Append-only, monotonically timestamped, mostly numeric. Time-partitioned chunks, Gorilla compression, downsampling, retention. How Prometheus, InfluxDB, TimescaleDB, Druid, and ClickHouse compare.

Gorilla ·chunks ·downsampling ·retention ·high cardinality

Read

13 Live

Choosing a database

The decision tree. OLTP, OLAP, KV, document, time-series, vector, graph, search — each access pattern with the named systems and the rewrite cost when the choice is wrong. Postgres is the right default until you can name the constraint that forces you off it.

decision tree ·OLTP vs OLAP ·polyglot persistence ·CDC ·rewrite cost

Read

14 Live

Index types

Every index type a working database ships — B-tree, hash, GIN, GiST, SP-GiST, BRIN, inverted, vector, bitmap. What each stores, what queries it answers, and when it makes the table slower than it was before.

B-tree ·GIN / GiST ·BRIN ·inverted ·vector (HNSW / IVF)

Read

B-trees, LSM, MVCC —one structure at a time.

Start here.

B-trees

LSM-trees

Write-ahead log

MVCC

Isolation levels

Query planner

Replication

Sharding

Distributed SQL

Buffer pool & page cache

Vacuum & compaction

Time-series databases

Choosing a database

Index types

The database curriculum.

B-trees, LSM, MVCC —
one structure at a time.