How a database writes a row.
Between "the client sends INSERT" and "the database says OK" there are five steps and one moment that costs more than all the others combined. The WAL append is fast. The buffer-pool modify is fast. The fsync to disk that makes the change durable is slow — and that's the one the database can't skip. Watch the row travel through the engine, see exactly where the time goes, and then watch the WAL save the row when the server crashes mid-transaction.
INSERT INTO users (id, name) VALUES (42, 'Alice'); what actually happens between INSERT and "OK" · plus what survives a crashEmpty state. The client is about to send an INSERT. The database has an empty buffer pool (in-memory page cache), an empty WAL buffer, and one page on disk holding the existing users table.
- Buffer pool
- In-memory cache of database pages. Reads and writes go here first; disk is only touched when needed.
- WAL
- Write-ahead log. An append-only record of every change. Always written to disk before the change is considered durable.
- Page
- The unit of disk I/O for a database. Usually 8 KB or 16 KB. Contains many rows.
The "write-ahead" trick, in one paragraph
Writing to the data page is expensive — it's a random write into the middle of a multi-GB file. Writing to the WAL is cheap — it's an append to the end of a sequential file. The trick: do the cheap write first, sync it to disk, then ACK the client. The data page gets updated in memory; the slow random write to disk happens later, in batches, asynchronously. If the server crashes between the ACK and the page write, recovery replays the WAL to rebuild the page. The client never knows.
Why fsync is the whole story
Every "how do I make my database faster?" question eventually arrives here. The fsync cost is the wall — it's why a transaction commits in milliseconds on SSD and tens of milliseconds on HDD. Group commit batches many transactions' WAL writes into one fsync; that's how Postgres does 10K commits/sec on a single SSD. Synchronous replication adds another fsync's worth (and a network round-trip); that's why HA setups always run slower than single-node.
The flip side: tools that lie about fsync (some consumer SSDs, some misconfigured RAID controllers) are very fast and also lose your data on power-cut. The classic Jepsen-style finding is "vendor X claims durability, here's the test that shows X loses writes." fsync honesty is the foundation of every claim about a database's safety.
What this visualization simplifies
- One page, one record. Real INSERTs often touch index pages too, each needing its own WAL records.
- No MVCC. Postgres and most modern DBs write a new row version; the old version stays around until VACUUM. The WAL covers both.
- No replication. In an HA setup, the commit also waits for the WAL to reach standby replicas.
- Single-threaded. Real engines have hundreds of concurrent transactions sharing the buffer pool and WAL.
- Synchronous commits only. Async commit (Postgres
synchronous_commit = off) skips waiting for fsync, gaining throughput at the cost of recent writes on a crash.
WAL — the full chapter →
Group commit, checkpointing strategies, ARIES recovery, the way Postgres / MySQL / SQLite each implement it.
Open the Codex