Write-ahead logging, why every durable store writes the log first.
Every durable store, from Postgres to SQLite to Kafka, writes the log first. The pages can wait. The log cannot.
What is a write-ahead log (WAL)?
Durability, at the speed of disk.
Write-ahead logging (WAL) is the technique of writing every database modification to a sequential log before applying it to the in-place data pages. Mohan, Haderle, Lindsay et al. formalised the modern approach (ARIES) at IBM in 1992. Every durable database — Postgres, MySQL, SQLite, RocksDB, ZFS, Kafka — implements a WAL. Sequential log writes are 10-100× faster than random page writes.
A database's job is to remember things. Tell it to write a row, ack the write, then survive a power cut — the row must still be there when the lights come on. The naive approach: write the row to its page, fsync, ack. Correct, but disastrously slow — random page writes are an order of magnitude slower than sequential ones.
Write-ahead logging is the trick that keeps it fast. Every change is appended to a sequential log file before the page is touched. Once the log entry is durable, the write is durable — even if the page itself is still in memory. Pages get flushed at leisure (via checkpoints), but the log is the source of truth.
Crash recovery in action: write, crash, recover
Write, crash, recover.
Below: a tiny store. Type a key/value, write, watch the log grow and the dirty page-cache change. Hit Crash to wipe the page cache (the in-memory copy). Hit Recover to replay the log into the durable pages.
The one rule of WAL: log first, page later
Log first, page later.
Two artifacts on disk. The log is sequential, append-only, and gets fsync-ed on every commit. The data pages are the actual rows organized in B-trees or heaps; they live in a buffer pool and get flushed asynchronously.
The protocol: a transaction's log records hit disk before the corresponding page modifications can be flushed. That's the rule. If the page hits disk first and we crash before the log catches up, recovery has no record of the change but the data already moved — silent corruption. WAL never lets that happen.
Why sequential writes beat random writes
Sequential beats random.
Spinning disks: a random seek is a 5–10 ms head movement. Sequential writes don't move the head at all — bytes flow at the platter's linear speed (typically 100+ MB/s). SSDs reduce the gap, but page-write amplification (writing a 16 KB page to flip 100 bytes) still tips the balance heavily toward sequential.
WAL turns random table updates into sequential log appends. Hundreds of small commits per second on a single disk become possible because all of them land in one stream. The pages catch up later, in big batches, when the buffer pool's cache decides to flush.
Checkpoints keep the log from growing forever
Flush dirty pages, then trim the log that is no longer needed.
If the log grew forever, recovery would take forever. Checkpoints are the database's solution: periodically, flush every dirty page in the buffer pool, then mark the log up to that point as no-longer-needed-for-recovery. Older log segments can be archived or deleted.
Checkpoints are an I/O storm — every dirty page hits disk at once. To avoid the spike, modern databases run them gradually: spread the page flushes over the checkpoint window, smooth the I/O. Postgres calls this checkpoint_completion_target; MySQL has its own variant.
Group commit: many transactions, one fsync
Many transactions, one fsync.
fsync is expensive — typically a few milliseconds even on SSD. If every transaction did its own fsync, throughput tops out around hundreds per second. Group commit batches concurrent transactions: a transaction that wants to commit is held briefly (a small ring buffer of pending commits) until other transactions are also ready, then a single fsync seals all of them at once.
The cost is latency — your transaction may wait a millisecond or two for friends. The benefit is dramatic: thousands of commits per second per disk. Almost every database does this; some let you tune the wait window. Critical for write-heavy workloads.
How the same log also drives replication
The log streams sideways.
WAL is also the mechanism for replication. The primary streams its log to a replica; the replica replays it; the replica's state mirrors the primary, lagged by a few milliseconds. This is exactly how Postgres streaming replication works, and it's why "logical decoding" tools (Debezium, etc.) tap straight into the WAL stream to broadcast database changes to other systems.
Synchronous replication holds the commit until at least one replica acks the log entry — durable across machines, not just disks. Asynchronous replication acks first and lets the replica catch up; faster, but a primary crash can lose the in-flight tail. Most production systems pick async for performance and accept the small data-loss window during failover — see transactions for what "committed" means precisely.
The WAL pattern repeats everywhere: Postgres, MySQL, ZFS, Kafka
The pattern repeats everywhere.
Journals.
ext4, XFS, NTFS — every modern filesystem journals metadata operations before applying them to the disk structure. Same pattern, different name.
A topic is a WAL.
Each Kafka partition is an append-only log; the consumer offset is the recovery pointer. The whole product is a WAL exposed as the primary API. See the Kafka guide.
Memtable + compaction.
RocksDB, Cassandra, ScyllaDB. The memtable is the in-memory page cache; the WAL recovers it on restart; SSTables are the eventual flush. Every read merges across layers — see the storage engine simulator.
How Postgres, MySQL, RocksDB, ZFS, and Kafka implement WAL
Same idea, different file formats.
- Postgres WAL
- Per-cluster WAL files of 16 MB (configurable). Streaming replication tails the WAL to standbys. Logical replication decodes the WAL into row-level events (since 9.4). The pg_basebackup + WAL-shipping pattern is the bedrock of every Postgres HA setup.
- MySQL InnoDB redo log
- Two redo log files (typically 48 MB each, configurable to GB). MySQL also has a separate binlog for replication and audit — historically two logs, two fsync points. MySQL 8.0+ optimised this with the innodb_redo_log_capacity dynamic sizing.
- RocksDB WAL
- Per-column-family WAL. Disabled by default for caches; enabled for durability. The classic tunable: WAL fsync per write (slow, durable) vs batched per N writes (fast, may lose recent writes on crash).
- ZFS ZIL
- The ZFS Intent Log. Separate from the main pool; can live on a fast SSD (SLOG) for sync-write performance. Database workloads on ZFS with ZIL on NVMe routinely hit 10k IOPS that the underlying spinning disks could not.
- Kafka topic log
- Kafka is a distributed WAL — every topic partition is an append-only log. The same primitive that gives databases durability gives Kafka its replay semantics.
The fsync question. Every WAL implementation has to decide when to fsync(). Per-transaction (Postgres default with synchronous_commit=on, MySQL with innodb_flush_log_at_trx_commit=1) gives full durability at ~1-2 ms per transaction overhead. Batched (synchronous_commit=off, group commit) trades a small loss-window for 10× throughput. The choice matters more than any other database tuning knob.
WAL is one of those ideas that's so foundational it disappears into the background. Append, fsync, apply, checkpoint, recover. Every durable system you trust runs this loop. The next time someone tells you a database is "fast," check whether they have group commit on, what their checkpoint cadence is, and how their replicas are wired. The answers will tell you why.
Further reading on write-ahead logging
Primary sources, in order.
- Postgres docsWrite-Ahead LoggingThe clearest authoritative explanation. Read it before tuning Postgres.
- Mohan et al · 1992ARIES — the WAL recovery algorithmThe seminal paper. Most production WAL implementations are ARIES variants.
- Semicolony guideDatabase indexingThe B-tree pages WAL is protecting. Pairs naturally with this.
- Semicolony guideKafka, as a riverA WAL exposed as the primary API. The same pattern at the system level.
- Semicolony simulatorStorage engineLSM vs B-tree side by side, both backed by a WAL.
- Semicolony deep diveHow SSDs work — FTL and write amplificationWhy an enterprise SSD with a supercap honors fsync at ~30,000/sec while a consumer SSD lies. The hardware below the WAL durability contract.