04 / 14
Internals / 04

MVCC

Multi-version concurrency control. Every write makes a new version of the row; every read sees the version that was committed when it started. Readers never block writers, writers never block readers. Postgres, InnoDB, Oracle, and SQL Server all ship some flavour of it. Here is what the metadata looks like, where the costs hide, and why snapshot isolation is not the same as serializable.


The problem locks created

Before MVCC, a database protected each row with a read/write lock. A reader took a shared lock; a writer took an exclusive one. The semantics were clean (every history was equivalent to some serial schedule) but the latency under contention was awful.

On a hot row, every reader queued behind any in-flight writer, and every writer queued behind every reader. A 5 ms write in a single-row counter table could push concurrent read p99 from 1 ms to hundreds of milliseconds the moment the workload became mixed. "Long select blocks every update" was a familiar shape of outage for anyone who ran MySQL MyISAM or older Oracle. The standard mitigations (shorter transactions, READ UNCOMMITTED, NOLOCK hints in SQL Server) all traded correctness for throughput.

The MVCC idea

Multi-version concurrency control changes the contract. A write does not overwrite the row; it appends a new version tagged with the writing transaction's id. A read does not look at the latest version; it looks at the version that was committed at the moment the reading transaction started.

Two simultaneous invariants fall out. Readers never block writers. A SELECT picks up the version it is allowed to see and walks away. Writers never block readers. An UPDATE installs a new version while the old one stays addressable for any transaction whose snapshot still needs it. The only readers and writers that ever contend are two writers on the same row, and that contention is unavoidable.

Visibility rules

The informal rule turns into something a storage engine can check on every tuple. A transaction T with snapshot id S can see version V when all of these hold:

visible(V, T) ≡
    V.committed_xid  <  T.snapshot_xid
  ∧ V.committed_xid  ∉  T.in_flight_at_snapshot
  ∧ (V.deleted_xid   =  NULL
      ∨ V.deleted_xid  >  T.snapshot_xid
      ∨ V.deleted_xid  ∈  T.in_flight_at_snapshot)

Every MVCC database implements this predicate. They differ in where the metadata lives (in the row itself, in an undo log, in a separate version store), how snapshots are represented (a single xid plus an in-flight set, or a list of active xids), and how aggressively old versions are reclaimed. The shape of the check is the same everywhere.

Postgres — xmin, xmax, and tuples on the heap

Postgres takes the maximalist approach: every row on the heap carries its own version metadata. The per-tuple header is 24 bytes, about as much as a typical narrow row's payload, and includes xmin (the inserting xid) and xmax (the deleting or updating xid). An UPDATE writes a new tuple version somewhere on the heap and sets the old tuple's xmax to point at the writer.

semicolony=# SELECT xmin, xmax, ctid, balance FROM accounts WHERE id = 42;
  xmin   |  xmax   |  ctid   | balance
---------+---------+---------+---------
 1004210 |       0 | (12,7)  |     500   -- live version

semicolony=# UPDATE accounts SET balance = 420 WHERE id = 42;
semicolony=# SELECT xmin, xmax, ctid, balance FROM accounts WHERE id = 42;
  xmin   |  xmax   |  ctid   | balance
---------+---------+---------+---------
 1004317 |       0 | (12,9)  |     420   -- new version
 -- old tuple at (12,7) still on the page with xmax=1004317

Old versions pile up on the same pages as live data. VACUUM walks the heap, finds tuples whose xmax is committed and older than every active snapshot, and reclaims their slots. When VACUUM cannot keep up, usually because a long-running transaction holds back the global xmin horizon, pages fill with dead tuples, indexes bloat, and the working set blows out of RAM. The "Postgres bloat" stories almost always trace back to one query that ran for two hours.

Vacuum is not optional. Autovacuum runs by default but is throttled. A reporting query against a high-churn table can hold back vacuum cluster-wide and double table size in a few hours. Monitor pg_stat_activity for transactions older than your churn budget, and consider old_snapshot_threshold on Postgres 14+ to let vacuum proceed at the cost of "snapshot too old" errors for offending readers.

MySQL InnoDB — undo logs and read views

InnoDB keeps only the latest version of each row in the clustered B+ tree. Older versions live in the rollback segment as a chain of undo records, each one a delta that says "to get the previous version, apply this change in reverse." A reader with an older snapshot that lands on the live row walks the undo chain backwards until it finds a version visible to its read view.

This costs CPU on every read, typically 5–15% added latency for short undo chains and more when chains run deep, but it keeps the primary B+ tree compact. The constraint moves to undo-log space: a long-running transaction that holds the purge horizon can blow up the undo tablespace until you run out of disk. InnoDB 8.0 added online undo truncation, but the failure mode still bites.

mysql> SHOW VARIABLES LIKE 'innodb_undo_log_truncate';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_undo_log_truncate | ON    |
+--------------------------+-------+

mysql> SELECT name, file_size/1024/1024 AS size_mb
       FROM information_schema.innodb_tablespaces
       WHERE name LIKE 'innodb_undo%';
+-----------------+---------+
| name            | size_mb |
+-----------------+---------+
| innodb_undo_001 |  1024.0 |
| innodb_undo_002 |  1024.0 |
+-----------------+---------+

Oracle — rollback segments, since 1992

Oracle shipped production MVCC in version 7.0, in 1992, earlier than any other mainstream RDBMS by a wide margin. The design is the ancestor of InnoDB's: rollback segments (later renamed undo tablespaces) hold pre-images, and a reader that lands on a row newer than its snapshot walks back through the rollback segment to reconstruct what the row looked like at SCN S.

The famous error from this design is ORA-01555: snapshot too old, raised when a long-running query asks for a version whose undo extent has been overwritten by newer transactions. Every Oracle DBA learned to size UNDO_RETENTION against their longest expected query and to keep an eye on overcommitted undo. The same problem shows up in every undo-based system; only the names change.

SQL Server — snapshot isolation in tempdb

SQL Server's MVCC is opt-in. ALTER DATABASE … SET READ_COMMITTED_SNAPSHOT ON (RCSI) and ALLOW_SNAPSHOT_ISOLATION ON (SI) turn it on. The version store lives in tempdb as a linked list of row versions, separate from the data pages. Without the option, SQL Server falls back to lock-based read committed and the pre-MVCC blocking behaviour.

Because the version store sits in tempdb, mixing heavy MVCC workloads with other tempdb users (sorts, hash joins, temp tables) creates contention nobody saw coming. The standard operational advice, give tempdb its own fast storage and pre-size it, exists largely because of the version store.

Snapshot isolation is not serializable

What pure MVCC gives you is snapshot isolation: every transaction reads from a consistent snapshot, and writes succeed only if no concurrent transaction wrote the same row first. SI rules out many anomalies (dirty reads, non-repeatable reads, lost updates on the same row) but it does not rule out write skew.

The canonical example: two on-call doctors, each reading "at least one other doctor is on call" and then setting themselves off-call. Under SI both transactions see the same snapshot, both checks pass, both commits succeed, and the hospital ends up with nobody on call. The reads and writes touched different rows so the first-writer-wins rule never fires.

SI permits write skew. If your correctness depends on a predicate over rows that you then update elsewhere, SI is not enough. Postgres 9.1 introduced Serializable Snapshot Isolation (SSI), which tracks rw-antidependencies between concurrent transactions and aborts one of any cycle at commit. See the long-form critique in A Critique of ANSI SQL Isolation Levels for why the SQL standard's definitions never caught this in the first place.

Garbage collection trade-offs

Every MVCC system has to decide when an old version is unreachable and can be reclaimed. Reclaim too early and a long-running snapshot gets a "snapshot too old" error or, worse, a silently wrong result. Reclaim too late and the database fills with versions nobody can see.

Postgres uses VACUUM (and the autovacuum daemon) to scan heap and indexes for tuples whose xmax is older than every active snapshot. InnoDB uses a background purge thread that walks the undo log and frees pages once no read view still needs them. Oracle reuses rollback segment extents in a circular buffer governed by UNDO_RETENTION. SQL Server prunes the tempdb version store as snapshots close.

All four share the same operational hazard: a single long-running transaction — a 30-minute analytics query, a forgotten BEGIN in a psql session, a stuck connection in a leaked pool — pins the GC horizon and lets versions pile up across every table the workload touches. The visible symptom is bloat, slow scans, and missed SLO; the root cause is a transaction that should have been broken into smaller pieces.

How the implementations compare

SystemVersion storageRead path costCleanup mechanismDefault for new DBs?
PostgreSQLIn-heap, per-tuple xmin/xmaxCheap (visibility check)VACUUM / autovacuumYes — read-committed on MVCC
MySQL InnoDBUndo log (rollback segment)Walk undo chain (~5–15% added)Background purge threadYes — REPEATABLE READ on MVCC
OracleUndo tablespaceWalk undo, possible ORA-01555Circular undo, UNDO_RETENTIONYes — since v7.0 (1992)
SQL Servertempdb version store (linked list)Walk version chainBackground cleanup as snapshots closeOpt-in — RCSI / SI must be enabled

The performance story

Read throughput under MVCC scales close to linearly with cores because nothing on the read path takes a contended lock. A 64-core box doing 99% reads on a hot row reaches close to 64× single-core throughput — something a lock-based design simply cannot do. This is the reason every modern OLTP database — Postgres, MySQL/InnoDB, Oracle, SQL Server with RCSI, CockroachDB, Spanner, YugabyteDB — ships MVCC by default.

Write throughput is the more nuanced story. Each write costs roughly what a non-MVCC write would, plus the bookkeeping for the new version, plus the eventual GC. As long as cleanup keeps up with churn the steady-state cost is small. When it falls behind — long transactions, undervized autovacuum, undersized undo — write latency stays fine but read latency degrades as version chains lengthen and the working set inflates. The operational discipline that MVCC asks for is "keep your transactions short, watch your oldest snapshot, let the cleaner run." Done well, it is invisible. Done badly, it is the source of most of the war stories.

Further reading

Found this useful?