10 min read · Guide · Storage
How it works · Storage · Durability

Write-ahead logging, why every durable store writes the log first.

Every durable store, from Postgres to SQLite to Kafka, writes the log first. The pages can wait. The log cannot.

Parts01–08 InteractiveCrash + recover PrereqPages / fsync

What is a write-ahead log (WAL)?

Durability, at the speed of disk.

Write-ahead logging (WAL) is the technique of writing every database modification to a sequential log before applying it to the in-place data pages. Mohan, Haderle, Lindsay et al. formalised the modern approach (ARIES) at IBM in 1992. Every durable database — Postgres, MySQL, SQLite, RocksDB, ZFS, Kafka — implements a WAL. Sequential log writes are 10-100× faster than random page writes.

A database's job is to remember things. Tell it to write a row, ack the write, then survive a power cut — the row must still be there when the lights come on. The naive approach: write the row to its page, fsync, ack. Correct, but disastrously slow — random page writes are an order of magnitude slower than sequential ones.

Write-ahead logging is the trick that keeps it fast. Every change is appended to a sequential log file before the page is touched. Once the log entry is durable, the write is durable — even if the page itself is still in memory. Pages get flushed at leisure (via checkpoints), but the log is the source of truth.


Crash recovery in action: write, crash, recover

Write, crash, recover.

Below: a tiny store. Type a key/value, write, watch the log grow and the dirty page-cache change. Hit Crash to wipe the page cache (the in-memory copy). Hit Recover to replay the log into the durable pages.

Sequential log · 0 entries
— empty —
In-memory page cache · 0 dirty
— clean —
Durable pages
— none yet —
Visible after recovery: 0 keys · empty

The one rule of WAL: log first, page later

Log first, page later.

Two artifacts on disk. The log is sequential, append-only, and gets fsync-ed on every commit. The data pages are the actual rows organized in B-trees or heaps; they live in a buffer pool and get flushed asynchronously.

The protocol: a transaction's log records hit disk before the corresponding page modifications can be flushed. That's the rule. If the page hits disk first and we crash before the log catches up, recovery has no record of the change but the data already moved — silent corruption. WAL never lets that happen.


Why sequential writes beat random writes

Sequential beats random.

Spinning disks: a random seek is a 5–10 ms head movement. Sequential writes don't move the head at all — bytes flow at the platter's linear speed (typically 100+ MB/s). SSDs reduce the gap, but page-write amplification (writing a 16 KB page to flip 100 bytes) still tips the balance heavily toward sequential.

WAL turns random table updates into sequential log appends. Hundreds of small commits per second on a single disk become possible because all of them land in one stream. The pages catch up later, in big batches, when the buffer pool's cache decides to flush.


Checkpoints keep the log from growing forever

Flush dirty pages, then trim the log that is no longer needed.

If the log grew forever, recovery would take forever. Checkpoints are the database's solution: periodically, flush every dirty page in the buffer pool, then mark the log up to that point as no-longer-needed-for-recovery. Older log segments can be archived or deleted.

Checkpoints are an I/O storm — every dirty page hits disk at once. To avoid the spike, modern databases run them gradually: spread the page flushes over the checkpoint window, smooth the I/O. Postgres calls this checkpoint_completion_target; MySQL has its own variant.


Group commit: many transactions, one fsync

Many transactions, one fsync.

fsync is expensive — typically a few milliseconds even on SSD. If every transaction did its own fsync, throughput tops out around hundreds per second. Group commit batches concurrent transactions: a transaction that wants to commit is held briefly (a small ring buffer of pending commits) until other transactions are also ready, then a single fsync seals all of them at once.

The cost is latency — your transaction may wait a millisecond or two for friends. The benefit is dramatic: thousands of commits per second per disk. Almost every database does this; some let you tune the wait window. Critical for write-heavy workloads.


How the same log also drives replication

The log streams sideways.

WAL is also the mechanism for replication. The primary streams its log to a replica; the replica replays it; the replica's state mirrors the primary, lagged by a few milliseconds. This is exactly how Postgres streaming replication works, and it's why "logical decoding" tools (Debezium, etc.) tap straight into the WAL stream to broadcast database changes to other systems.

Synchronous replication holds the commit until at least one replica acks the log entry — durable across machines, not just disks. Asynchronous replication acks first and lets the replica catch up; faster, but a primary crash can lose the in-flight tail. Most production systems pick async for performance and accept the small data-loss window during failover — see transactions for what "committed" means precisely.


The WAL pattern repeats everywhere: Postgres, MySQL, ZFS, Kafka

The pattern repeats everywhere.

Filesystems

Journals.

ext4, XFS, NTFS — every modern filesystem journals metadata operations before applying them to the disk structure. Same pattern, different name.

Kafka

A topic is a WAL.

Each Kafka partition is an append-only log; the consumer offset is the recovery pointer. The whole product is a WAL exposed as the primary API. See the Kafka guide.

LSM trees

Memtable + compaction.

RocksDB, Cassandra, ScyllaDB. The memtable is the in-memory page cache; the WAL recovers it on restart; SSTables are the eventual flush. Every read merges across layers — see the storage engine simulator.

How Postgres, MySQL, RocksDB, ZFS, and Kafka implement WAL

Same idea, different file formats.

Postgres WAL
Per-cluster WAL files of 16 MB (configurable). Streaming replication tails the WAL to standbys. Logical replication decodes the WAL into row-level events (since 9.4). The pg_basebackup + WAL-shipping pattern is the bedrock of every Postgres HA setup.
MySQL InnoDB redo log
Two redo log files (typically 48 MB each, configurable to GB). MySQL also has a separate binlog for replication and audit — historically two logs, two fsync points. MySQL 8.0+ optimised this with the innodb_redo_log_capacity dynamic sizing.
RocksDB WAL
Per-column-family WAL. Disabled by default for caches; enabled for durability. The classic tunable: WAL fsync per write (slow, durable) vs batched per N writes (fast, may lose recent writes on crash).
ZFS ZIL
The ZFS Intent Log. Separate from the main pool; can live on a fast SSD (SLOG) for sync-write performance. Database workloads on ZFS with ZIL on NVMe routinely hit 10k IOPS that the underlying spinning disks could not.
Kafka topic log
Kafka is a distributed WAL — every topic partition is an append-only log. The same primitive that gives databases durability gives Kafka its replay semantics.

The fsync question. Every WAL implementation has to decide when to fsync(). Per-transaction (Postgres default with synchronous_commit=on, MySQL with innodb_flush_log_at_trx_commit=1) gives full durability at ~1-2 ms per transaction overhead. Batched (synchronous_commit=off, group commit) trades a small loss-window for 10× throughput. The choice matters more than any other database tuning knob.



A closing note

WAL is one of those ideas that's so foundational it disappears into the background. Append, fsync, apply, checkpoint, recover. Every durable system you trust runs this loop. The next time someone tells you a database is "fast," check whether they have group commit on, what their checkpoint cadence is, and how their replicas are wired. The answers will tell you why.

Related Database scaling WAL
Found this useful?