03 / 14
Internals / 03

Write-ahead logging

The write-ahead log is the foundation every transactional database stands on. It turns the impossible promise, survive a crash without losing committed data, into a single rule: log the change, fsync the log, then touch the page. ARIES, group commit, logical decoding, streaming replication, and every modern CDC pipeline all flow from that one ordering.


The setup

A database has to survive crashes without losing committed data. The naive way is to write every modified page to durable storage before acknowledging the commit. On a B-tree with 8 KB pages, a single one-row update can dirty several pages: index leaf, heap, a free-space map entry. Forcing all of them to disk per transaction is six or seven random fsyncs and tens of milliseconds of latency. No real OLTP database does this.

The trick is to write only a tiny description of the change, a log record, to a single append-only file (the WAL), fsync that, and let the modified pages drift back to disk lazily. The log is sequential, so the only seek the disk does per commit is the head moving forward by a few hundred bytes. Pages get flushed later, in big batched I/O, when the buffer pool decides or a checkpoint forces it. Durability comes from the log; performance comes from deferring the page writes.

The protocol

Every change to a page is described by a log record: which page, which byte range, the before-image (for undo) and the after-image (for redo), the transaction id, and a monotonic LSN (Log Sequence Number). The protocol has two hard rules:

  • WAL rule. The log record describing a change must be on durable storage before the modified page is allowed to be written back to disk. The page header carries a pageLSN; the buffer manager refuses to evict a page whose pageLSN exceeds the WAL's flushed LSN.
  • Commit rule. A transaction is not durable until the commit record is on disk. The client sees a successful commit only after the fsync of the log returns.

Those two rules are the whole "write-ahead" contract. Everything else, group commit, recovery, replication, is detail built on top of them.

ARIES (1992)

ARIES (Algorithms for Recovery and Isolation Exploiting Semantics, Mohan et al., IBM Almaden 1992) is the canonical write-ahead logging algorithm. It's what Postgres, InnoDB, DB2, and SQL Server all implement variations of. See the paper walk-through for the original. The pieces that matter:

  • LSN. Every log record has a unique monotonic id. Pages carry the LSN of the most recent change applied to them.
  • page-LSN comparison. During redo, a record at LSN L is applied to a page only if pageLSN < L. This makes redo idempotent.
  • Undo records. The before-image needed to roll back an uncommitted transaction.
  • Redo records. The after-image needed to replay a committed change whose page never made it to disk.
  • CLR. Compensation log records, written during undo so that a crash during recovery itself doesn't lose the undo work.

Three-pass recovery

After a crash, ARIES walks the log three times:

  1. Analysis. Scan forward from the last checkpoint. Rebuild the transaction table (which transactions were active at the crash) and the dirty page table (which pages may have un-flushed changes). The output is "where to start redoing" and "what to undo".
  2. Redo. Scan forward from the earliest dirty-page LSN, applying every log record whose target page has pageLSN < recordLSN. This brings the database back to the exact state it was in at the moment of the crash, including the uncommitted work of doomed transactions.
  3. Undo. For every transaction in the transaction table that didn't commit, walk its log records backward and apply the undo images. Each undo write produces a CLR so that a crash mid-undo can resume.

Recovery time is bounded by the log distance since the last checkpoint, which is why checkpoints matter (more on this below).

Group commit

An fsync on a modern NVMe device costs ~50–200 microseconds; on a cloud network block device (EBS gp3, Azure managed disk), 1–3 milliseconds. At 1 ms per fsync, a single thread can do 1,000 commits/second. A database with 200 concurrent connections that each commit once a second wants 200 fsyncs/second of throughput from one disk head. The fix is group commit: batch the fsyncs.

When a transaction reaches its commit point, it appends its commit record to the WAL buffer and waits a few microseconds for other transactions to do the same. The first thread to win the race issues one fsync covering everyone's records; all of them return to the client. The cost of the fsync is paid once and amortised across the batch. Postgres exposes this as commit_delay and commit_siblings; MySQL InnoDB uses innodb_flush_log_at_trx_commit (1 = fsync per commit, 2 = write to the OS per commit but fsync ~once per second, 0 = write and fsync ~once per second).

Why WAL fsync dominates write latency. The WAL fsync is the only synchronous disk operation on a commit path. Page writes are async, index updates are in-memory, undo records are buffered. If your p99 commit latency is 4 ms on EBS, it's the fsync. Group commit pays for itself the moment your commit rate exceeds 1 / fsync-time.

Synchronous vs asynchronous replication

Once you have a WAL, replication is "ship the log". The question is when the primary considers a transaction committed.

  • Synchronous. The primary waits for at least one replica to ack receipt (or apply) of the log record before returning commit. Postgres' synchronous_commit = on with synchronous_standby_names set; MySQL's semi-sync replication. Throughput drops to the RTT-bounded ceiling (at 1 ms RTT, ~1,000 commits/sec per session). The win is zero data loss on primary crash.
  • Asynchronous. The primary acks the client immediately and ships log records best-effort. Throughput is the same as a standalone primary. The loss is whatever was in flight when the primary died, usually tens to hundreds of milliseconds of writes.

Most production setups split: one synchronous replica in the same AZ (low RTT, durable ack) plus N asynchronous replicas across regions (for read scale and DR). Postgres supports this directly via synchronous_standby_names = 'ANY 1 (replica_a, replica_b)'.

Checkpoints

Without checkpoints, the log grows forever and recovery has to replay everything since the database was created. A checkpoint does two things: it flushes all dirty pages in the buffer pool to disk, and it writes a checkpoint marker into the log that says "every page on disk is at least this LSN". Recovery only needs to consider log records past the last completed checkpoint.

A naive (sharp) checkpoint stops the world while it flushes, which is unacceptable on a busy database. ARIES introduces fuzzy checkpoints: the checkpoint record is written immediately, dirty pages are flushed in the background, and recovery just has to start from the earliest dirty-page LSN at the time the checkpoint began. Postgres' checkpoint_timeout and max_wal_size bound how often this happens; the trade-off is recovery time vs steady-state I/O.

Log shipping vs logical replication

There are two ways to consume a WAL on a replica.

  • Physical (log shipping). Ship the raw WAL bytes; the replica applies them block-for-block to identical on-disk pages. Postgres streaming replication and MySQL's row-based binlog in physical mode both do this. The replica must be byte-for-byte identical to the primary: same schema, same indexes, same Postgres version, same architecture. Replay is cheap because it's just memcpy into pages.
  • Logical. Parse the WAL into row-level changes: INSERT, UPDATE, DELETE with primary key and column values, then ship those. The downstream doesn't have to be the same database. Postgres logical decoding with pgoutput or wal2json, MySQL binlog in row format, Debezium. Replay is more expensive because the consumer rebuilds row state, but the output is portable.

WAL as a stream

Every modern change-data-capture pipeline reads the WAL. Debezium parses Postgres logical decoding output and MySQL binlogs into Kafka topics. Maxwell does the MySQL binlog directly. AWS DMS, Fivetran, Airbyte's CDC connectors are all WAL readers. Stripe's data infrastructure famously runs the entire warehouse off binlog streams; LinkedIn's Databus was an early production version of the same idea.

The WAL is the source of truth. Once a database exposes its WAL as a stream, that stream becomes the canonical change log for the rest of the system: caches, search indexes, analytics warehouses, downstream microservices, audit logs. The database stops being just storage and starts being the head of a data pipeline. This is the architectural shift behind "log-oriented architectures" (Kreps, 2013) and the entire modern CDC tooling ecosystem.

Storage layout

Every engine lays its WAL out a little differently:

EngineFilesConfig + behaviour
Postgres pg_wal/ (16 MB segments) synchronous_commit, wal_level = replica|logical, max_wal_size, archive_command. Logical decoding via replication slots.
MySQL InnoDB ib_logfile0, ib_logfile1 (circular) innodb_flush_log_at_trx_commit (0/1/2), innodb_log_file_size. Redo log is for crash recovery; binlog is a separate log for replication and CDC.
SQLite WAL mode main.db-wal, main.db-shm PRAGMA journal_mode=WAL, PRAGMA synchronous=NORMAL|FULL. Readers don't block writers; the WAL is checkpointed back into the main file periodically.
RocksDB *.log in the DB directory WriteOptions::sync, WAL_ttl_seconds. Per-CF or shared WAL; recovery replays into the memtable on open.

A sample of what's actually inside a Postgres WAL segment, as shown by pg_waldump:

rmgr: Heap    len (rec/tot): 54/  54, tx: 1024, lsn: 0/01A4F1B8,
  prev 0/01A4F178, desc: INSERT off 12 flags 0x00, blkref #0: rel 1663/16384/24576 blk 8

rmgr: Btree   len (rec/tot): 72/  72, tx: 1024, lsn: 0/01A4F1F0,
  prev 0/01A4F1B8, desc: INSERT_LEAF off 6, blkref #0: rel 1663/16384/24577 blk 2

rmgr: Transaction len: 34/ 34, tx: 1024, lsn: 0/01A4F238,
  prev 0/01A4F1F0, desc: COMMIT 2026-05-19 14:22:08.119 UTC

Three records, heap insert, index leaf insert, commit, and the commit record at LSN 0/01A4F238 is what fsync had to reach before the client got its OK.

Further reading

Found this useful?