MVCC
Multi-version concurrency control. Every write makes a new version of the row; every read sees the version that was committed when it started. Readers never block writers, writers never block readers. Postgres, InnoDB, Oracle, and SQL Server all ship some flavour of it. Here is what the metadata looks like, where the costs hide, and why snapshot isolation is not the same as serializable.
The problem locks created
Before MVCC, a database protected each row with a read/write lock. A reader took a shared lock; a writer took an exclusive one. The semantics were clean (every history was equivalent to some serial schedule) but the latency under contention was awful.
On a hot row, every reader queued behind any in-flight writer, and every writer queued behind every reader. A 5 ms write in a single-row counter table could push concurrent read p99 from 1 ms to hundreds of milliseconds the moment the workload became mixed. "Long select blocks every update" was a familiar shape of outage for anyone who ran MySQL MyISAM or older Oracle. The standard mitigations (shorter transactions, READ UNCOMMITTED, NOLOCK hints in SQL Server) all traded correctness for throughput.
The MVCC idea
Multi-version concurrency control changes the contract. A write does not overwrite the row; it appends a new version tagged with the writing transaction's id. A read does not look at the latest version; it looks at the version that was committed at the moment the reading transaction started.
Two simultaneous invariants fall out. Readers never block writers. A SELECT picks up the version it is allowed to see and walks away. Writers never block readers. An UPDATE installs a new version while the old one stays addressable for any transaction whose snapshot still needs it. The only readers and writers that ever contend are two writers on the same row, and that contention is unavoidable.
Visibility rules
The informal rule turns into something a storage engine can check on every tuple. A transaction T with snapshot id S can see version V when all of these hold:
visible(V, T) ≡
V.committed_xid < T.snapshot_xid
∧ V.committed_xid ∉ T.in_flight_at_snapshot
∧ (V.deleted_xid = NULL
∨ V.deleted_xid > T.snapshot_xid
∨ V.deleted_xid ∈ T.in_flight_at_snapshot)Every MVCC database implements this predicate. They differ in where the metadata lives (in the row itself, in an undo log, in a separate version store), how snapshots are represented (a single xid plus an in-flight set, or a list of active xids), and how aggressively old versions are reclaimed. The shape of the check is the same everywhere.
Postgres — xmin, xmax, and tuples on the heap
Postgres takes the maximalist approach: every row on the heap carries its own version
metadata. The per-tuple header is 24 bytes, about as much as a typical narrow row's
payload, and includes xmin (the inserting xid) and xmax (the
deleting or updating xid). An UPDATE writes a new tuple version somewhere on the heap and
sets the old tuple's xmax to point at the writer.
semicolony=# SELECT xmin, xmax, ctid, balance FROM accounts WHERE id = 42;
xmin | xmax | ctid | balance
---------+---------+---------+---------
1004210 | 0 | (12,7) | 500 -- live version
semicolony=# UPDATE accounts SET balance = 420 WHERE id = 42;
semicolony=# SELECT xmin, xmax, ctid, balance FROM accounts WHERE id = 42;
xmin | xmax | ctid | balance
---------+---------+---------+---------
1004317 | 0 | (12,9) | 420 -- new version
-- old tuple at (12,7) still on the page with xmax=1004317Old versions pile up on the same pages as live data. VACUUM walks the
heap, finds tuples whose xmax is committed and older than every active
snapshot, and reclaims their slots. When VACUUM cannot keep up, usually because a
long-running transaction holds back the global xmin horizon, pages fill with dead tuples,
indexes bloat, and the working set blows out of RAM. The "Postgres bloat" stories almost
always trace back to one query that ran for two hours.
pg_stat_activity for transactions older
than your churn budget, and consider old_snapshot_threshold on Postgres 14+
to let vacuum proceed at the cost of "snapshot too old" errors for offending readers.MySQL InnoDB — undo logs and read views
InnoDB keeps only the latest version of each row in the clustered B+ tree. Older versions live in the rollback segment as a chain of undo records, each one a delta that says "to get the previous version, apply this change in reverse." A reader with an older snapshot that lands on the live row walks the undo chain backwards until it finds a version visible to its read view.
This costs CPU on every read, typically 5–15% added latency for short undo chains and more when chains run deep, but it keeps the primary B+ tree compact. The constraint moves to undo-log space: a long-running transaction that holds the purge horizon can blow up the undo tablespace until you run out of disk. InnoDB 8.0 added online undo truncation, but the failure mode still bites.
mysql> SHOW VARIABLES LIKE 'innodb_undo_log_truncate';
+--------------------------+-------+
| Variable_name | Value |
+--------------------------+-------+
| innodb_undo_log_truncate | ON |
+--------------------------+-------+
mysql> SELECT name, file_size/1024/1024 AS size_mb
FROM information_schema.innodb_tablespaces
WHERE name LIKE 'innodb_undo%';
+-----------------+---------+
| name | size_mb |
+-----------------+---------+
| innodb_undo_001 | 1024.0 |
| innodb_undo_002 | 1024.0 |
+-----------------+---------+Oracle — rollback segments, since 1992
Oracle shipped production MVCC in version 7.0, in 1992, earlier than any other mainstream RDBMS by a wide margin. The design is the ancestor of InnoDB's: rollback segments (later renamed undo tablespaces) hold pre-images, and a reader that lands on a row newer than its snapshot walks back through the rollback segment to reconstruct what the row looked like at SCN S.
The famous error from this design is ORA-01555: snapshot too old, raised
when a long-running query asks for a version whose undo extent has been overwritten by
newer transactions. Every Oracle DBA learned to size UNDO_RETENTION against
their longest expected query and to keep an eye on overcommitted undo. The same problem
shows up in every undo-based system; only the names change.
SQL Server — snapshot isolation in tempdb
SQL Server's MVCC is opt-in. ALTER DATABASE … SET READ_COMMITTED_SNAPSHOT ON
(RCSI) and ALLOW_SNAPSHOT_ISOLATION ON (SI) turn it on. The version store
lives in tempdb as a linked list of row versions, separate from the data
pages. Without the option, SQL Server falls back to lock-based read committed and the
pre-MVCC blocking behaviour.
Because the version store sits in tempdb, mixing heavy MVCC workloads with other tempdb users (sorts, hash joins, temp tables) creates contention nobody saw coming. The standard operational advice, give tempdb its own fast storage and pre-size it, exists largely because of the version store.
Snapshot isolation is not serializable
What pure MVCC gives you is snapshot isolation: every transaction reads from a consistent snapshot, and writes succeed only if no concurrent transaction wrote the same row first. SI rules out many anomalies (dirty reads, non-repeatable reads, lost updates on the same row) but it does not rule out write skew.
The canonical example: two on-call doctors, each reading "at least one other doctor is on call" and then setting themselves off-call. Under SI both transactions see the same snapshot, both checks pass, both commits succeed, and the hospital ends up with nobody on call. The reads and writes touched different rows so the first-writer-wins rule never fires.
Garbage collection trade-offs
Every MVCC system has to decide when an old version is unreachable and can be reclaimed. Reclaim too early and a long-running snapshot gets a "snapshot too old" error or, worse, a silently wrong result. Reclaim too late and the database fills with versions nobody can see.
Postgres uses VACUUM (and the autovacuum daemon) to scan heap and indexes
for tuples whose xmax is older than every active snapshot. InnoDB uses a
background purge thread that walks the undo log and frees pages once no read view still
needs them. Oracle reuses rollback segment extents in a circular buffer governed by
UNDO_RETENTION. SQL Server prunes the tempdb version store as snapshots
close.
All four share the same operational hazard: a single long-running transaction — a 30-minute
analytics query, a forgotten BEGIN in a psql session, a stuck connection in a
leaked pool — pins the GC horizon and lets versions pile up across every table the
workload touches. The visible symptom is bloat, slow scans, and missed SLO; the root cause
is a transaction that should have been broken into smaller pieces.
How the implementations compare
| System | Version storage | Read path cost | Cleanup mechanism | Default for new DBs? |
|---|---|---|---|---|
| PostgreSQL | In-heap, per-tuple xmin/xmax | Cheap (visibility check) | VACUUM / autovacuum | Yes — read-committed on MVCC |
| MySQL InnoDB | Undo log (rollback segment) | Walk undo chain (~5–15% added) | Background purge thread | Yes — REPEATABLE READ on MVCC |
| Oracle | Undo tablespace | Walk undo, possible ORA-01555 | Circular undo, UNDO_RETENTION | Yes — since v7.0 (1992) |
| SQL Server | tempdb version store (linked list) | Walk version chain | Background cleanup as snapshots close | Opt-in — RCSI / SI must be enabled |
The performance story
Read throughput under MVCC scales close to linearly with cores because nothing on the read path takes a contended lock. A 64-core box doing 99% reads on a hot row reaches close to 64× single-core throughput — something a lock-based design simply cannot do. This is the reason every modern OLTP database — Postgres, MySQL/InnoDB, Oracle, SQL Server with RCSI, CockroachDB, Spanner, YugabyteDB — ships MVCC by default.
Write throughput is the more nuanced story. Each write costs roughly what a non-MVCC write would, plus the bookkeeping for the new version, plus the eventual GC. As long as cleanup keeps up with churn the steady-state cost is small. When it falls behind — long transactions, undervized autovacuum, undersized undo — write latency stays fine but read latency degrades as version chains lengthen and the working set inflates. The operational discipline that MVCC asks for is "keep your transactions short, watch your oldest snapshot, let the cleaner run." Done well, it is invisible. Done badly, it is the source of most of the war stories.
Further reading
- Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil (1995) — A Critique of ANSI SQL Isolation Levels — the paper that named snapshot isolation, exposed write skew, and reset how the industry talks about isolation.
- PostgreSQL — Concurrency Control (chapter 13) — the canonical reference on Postgres's visibility rules, snapshot management, and VACUUM.
- Ports & Grittner (2012) — Serializable Snapshot Isolation in PostgreSQL — how Postgres 9.1 added true serializability on top of SI by tracking rw-antidependency cycles.
- Fekete, O'Neil & O'Neil (2004) — A Read-Only Transaction Anomaly Under Snapshot Isolation — a further SI weakness: even a read-only transaction can observe a non-serializable result.
- MySQL — InnoDB Multi-Versioning — the official walkthrough of undo logs, read views, and purge.
- Bernstein, Hadzilacos, Goodman (1987) — Concurrency Control and Recovery in Database Systems — the textbook. Out of print, free PDF on the authors' sites. Chapters 4 and 5 cover multi-version schedulers in full generality.
- Suzuki — The Internals of PostgreSQL, ch. 5 (Concurrency Control) — deep walkthrough of Postgres's t_xmin, t_xmax, t_ctid, and the visibility predicate at code level.
- Semicolony — Isolation levels — how the SI/SSI distinction maps onto the SQL standard's levels and what each database actually delivers under each name.