Multi-page · for storage engineers
Database internals

B-trees, LSM, MVCC —
one structure at a time.

B-trees, LSM, WAL, MVCC, isolation levels, query planning, replication, sharding, distributed SQL, the buffer pool. The structures that every storage engine assembles differently. Each one explained at the level of "what's the data structure", "what's the algorithm", and "what does this mean for the SQL you write".

All fourteen sub-pages are live. Each is a long-form walk-through with engine source links, papers, and references to the canon (Hellerstein, Stonebraker, Petrov, the Postgres source).


Live deep dives

Start here.

01 Live

B-trees

The on-disk index every relational database has shipped since 1972. Page layout, splits, fan-out, latching, copy-on-write variants. Why Postgres, MySQL, and SQLite all picked the same shape.

B+ tree ·page splits ·latching ·fan-out ·CoW B-trees
Read
02 Live

LSM-trees

RocksDB, LevelDB, Cassandra, ScyllaDB — all the same shape. Memtable, immutable SSTables, leveled compaction, tombstones, bloom filters. Why writes are sequential and reads are amplified.

memtable ·SSTable ·compaction ·bloom filters ·write amplification
Read
03 Live

Write-ahead log

The first thing every durable database does. fsync, group commit, log shipping, ARIES recovery. Why the WAL is the truth and the data files are the cache.

fsync ·group commit ·ARIES ·checkpoints ·log shipping
Read
04 Live

MVCC

Multi-version concurrency control — how Postgres, Oracle, and MySQL/InnoDB give every transaction its own snapshot of the database without locking readers behind writers.

snapshot isolation ·xmin/xmax ·undo log ·vacuum ·visibility map
Read
05 Live

Isolation levels

Read uncommitted to serializable, plus the anomalies they each forbid. Dirty reads, non-repeatable reads, phantoms, write skew. What your database actually gives you when you say SERIALIZABLE.

ANSI levels ·snapshot isolation ·serializable ·write skew ·SSI
Read
06 Live

Query planner

Parser, rewriter, planner, executor. Cost estimation, statistics, join orders, hash vs merge vs nested loop, EXPLAIN ANALYZE. Why your query is slow and what the planner thought it was doing.

cost model ·join algorithms ·EXPLAIN ·statistics ·plan caching
Read
07 Live

Replication

Statement vs row vs logical, sync vs async, leader-follower vs multi-leader. The Postgres streaming replication, MySQL binlog, MongoDB oplog, Aurora's "log is the database".

logical replication ·physical replication ·replication lag ·failover ·Aurora
Read
08 Live

Sharding

Range, hash, geo, directory. Vitess, Citus, MongoDB sharding. The hot-shard problem, the resharding problem, the cross-shard transaction problem.

range sharding ·hash sharding ·Vitess ·Citus ·cross-shard txn
Read
09 Live

Distributed SQL

Spanner, CockroachDB, YugabyteDB, TiDB. How you keep ACID semantics when the data is on a hundred machines — TrueTime, hybrid logical clocks, Raft groups per range.

Spanner ·TrueTime ·CockroachDB ·Raft per range ·distributed txn
Read
10 Live

Buffer pool & page cache

Where rows actually live in RAM. The OS page cache vs the database's own buffer pool, dirty page write-back, the double-write buffer, why O_DIRECT changes the math.

buffer pool ·page cache ·dirty pages ·O_DIRECT ·double write
Read
11 Live

Vacuum & compaction

What MVCC and LSM systems do with the garbage they accumulate. Postgres autovacuum, transaction-ID wraparound, RocksDB leveled vs universal compaction, write-amplification trade-offs.

autovacuum ·XID wraparound ·compaction ·tombstones ·write amplification
Read
12 Live

Time-series databases

Append-only, monotonically timestamped, mostly numeric. Time-partitioned chunks, Gorilla compression, downsampling, retention. How Prometheus, InfluxDB, TimescaleDB, Druid, and ClickHouse compare.

Gorilla ·chunks ·downsampling ·retention ·high cardinality
Read
13 Live

Choosing a database

The decision tree. OLTP, OLAP, KV, document, time-series, vector, graph, search — each access pattern with the named systems and the rewrite cost when the choice is wrong. Postgres is the right default until you can name the constraint that forces you off it.

decision tree ·OLTP vs OLAP ·polyglot persistence ·CDC ·rewrite cost
Read
14 Live

Index types

Every index type a working database ships — B-tree, hash, GIN, GiST, SP-GiST, BRIN, inverted, vector, bitmap. What each stores, what queries it answers, and when it makes the table slower than it was before.

B-tree ·GIN / GiST ·BRIN ·inverted ·vector (HNSW / IVF)
Read