11 / 14
Internals / 11

Vacuum & compaction

MVCC databases like Postgres keep old row versions for snapshot reads. LSM databases like RocksDB and Cassandra append new values without touching the old ones. Both designs trade write speed for a debt that has to be paid in the background. Vacuum and compaction pay that debt. When they fall behind, production breaks.


Why dead data accumulates

Both MVCC and LSM are no-overwrite designs. They optimise the write path by refusing to modify data in place, and they hand the reclamation problem to a background process.

In Postgres MVCC, an UPDATE never overwrites the old row. It writes a new tuple and marks the old one with xmax, the transaction ID that obsoleted it. The old tuple stays on disk because some long-running snapshot might still need to see it. A DELETE is the same shape: the row is marked, not erased.

In an LSM tree, an update writes a new key/value pair to the memtable, which eventually lands in a fresh SSTable. The older value, sitting in some older SSTable at a lower level, is still on disk. Reads find the newest version first and stop there, but the old bytes are real.

Without reclamation, both engines bloat without bound. Postgres tables grow until the heap is mostly dead tuples; LSM trees grow until every read has to walk through every level. The work to keep them bounded is what this page is about.

Postgres VACUUM

VACUUM is the Postgres process that walks a relation, identifies tuples whose xmax is older than every running transaction, and marks the space they occupy as reusable in the free space map. It does three other things along the way that matter just as much:

  • Updates the visibility map so future queries can skip pages where every tuple is visible to all transactions. This is what makes index-only scans possible.
  • Freezes old transaction IDs by replacing them with a special FrozenXID marker, so the tuple becomes visible to every future transaction without consulting the XID clock.
  • Updates planner statistics if you ran VACUUM ANALYZE.

AUTOVACUUM runs VACUUM in the background. It wakes up every autovacuum_naptime (default 1 minute), looks at pg_stat_all_tables, and launches a worker for any table where dead tuples exceed autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × n_live_tup.

VACUUM vs VACUUM FULL

Regular VACUUM marks dead-tuple space as reusable inside existing pages. The on-disk file size doesn't change; new inserts fill the holes. This is usually what you want: it's online, it takes a non-blocking SHARE UPDATE EXCLUSIVE lock, and it runs continuously without operator attention.

VACUUM FULL rewrites the entire table into a fresh file with no dead tuples, then swaps the relfilenode. The on-disk size shrinks to the live-tuple size. The cost: an ACCESS EXCLUSIVE lock for the duration, which blocks every read and write. Only run it in a maintenance window, and only when bloat is so bad that the reusable-space approach can't catch up.

The pg_repack and pg_squeeze extensions do the same shrink-on-disk work without the exclusive lock by keeping a shadow table and swapping at the end. They are the standard production answer when bloat needs to actually go away.

The bloat problem

Autovacuum tuning is the single most common Postgres-in-production issue. The default autovacuum_vacuum_scale_factor of 0.2 means a table is only vacuumed once 20% of its rows are dead. For a 1-million-row table that's fine. For a 1-billion-row table that's 200 million dead tuples accumulating before a single vacuum runs, and the vacuum itself takes hours.

The standard fix is per-table overrides on the hot tables:

ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.02,
  autovacuum_vacuum_threshold    = 1000,
  autovacuum_analyze_scale_factor = 0.01
);

To see how bloated a table actually is, the pgstattuple extension reports the live-vs-dead ratio exactly (at the cost of a full scan), and pg_stat_user_tables.n_dead_tup gives a free estimate that's good enough for monitoring. A healthy ratio sits below 20%; anything above 50% is a tuning conversation.

XID wraparound

Postgres transaction IDs are 32-bit. The space wraps every ~4 billion transactions, and MVCC uses modular arithmetic on a 2-billion-transaction window to decide which XIDs are in the past and which are in the future. If a tuple's xmin falls more than 2 billion transactions behind the current XID, the database can no longer tell whether the row is visible.

Vacuum's freeze step prevents this: it rewrites old xmin values to the special FrozenXID sentinel, which is always visible. If autovacuum can't keep up, usually because long-running transactions block it, or because vacuum throughput is capped too low, the database approaches wraparound. At autovacuum_freeze_max_age (200M by default) Postgres launches forced anti-wraparound vacuums; at 2B it shuts down writes entirely and demands a single-user vacuum.

Why XID wraparound is the Postgres footgun. Sentry's 2015 outage, Mailchimp's 2019 incident, and several Heroku Postgres events all traced back to anti-wraparound vacuums blocking writes on multi-TB tables. The pathology is always the same: a long-running transaction (a leaked replication slot, an idle-in-transaction session, a stuck logical decoder) pins the xmin horizon, vacuum can't advance, XID age climbs, and by the time alerting fires you have hours of forced vacuum ahead of you. Alert on pg_database.datfrozenxid age, not on vacuum lag.

LSM compaction strategies

LSM compaction is the mirror-image problem. Where MVCC has dead row versions, LSM has obsolete SSTable entries: older keys shadowed by newer writes, tombstones, expired TTL data. Compaction reads N SSTables, merges them by key, drops the obsolete versions, and writes M (usually M < N) replacement files. The strategy you pick decides where on the read/write/space-amp triangle you land:

StrategyWhere it livesRead ampWrite amp
Leveled (LCS) RocksDB default, Cassandra LCS Low; non-overlapping ranges per level High; each level rewritten ~10×
Size-tiered (STCS) Cassandra default, RocksDB universal High; multiple overlapping files per tier Low; merge only when N similar-sized files exist
Time-window (TWCS) Cassandra time-series Low for time-range reads Low; files segregated by write time, dropped whole on TTL

Leveled fits read-heavy OLTP workloads on flash: read latency is bounded, the storage footprint is small. Size-tiered fits write-heavy ingest where read latency matters less. Time-window fits append-only metrics where the cleanup story is "drop the whole file when it ages out".

Write amplification

Every record in a leveled LSM is rewritten on every level it passes through. With 10 levels and a per-level fan-out of 10, a single logical write turns into roughly 10 physical writes to the underlying storage. RocksDB reports this as the write amplification factor (WAF), and in production it's usually between 5 and 30 depending on workload and level count.

This matters because SSDs have finite write endurance, usually rated in drive writes per day (DWPD) over a 3 or 5 year warranty. A consumer TLC drive at 0.3 DWPD with a database doing WAF=20 has effectively 0.015 DWPD of logical write budget. Pick the wrong compaction strategy on the wrong drive class and you burn out the disk before the warranty is up.

** Compaction Stats [default] **
Level   Files   Size     Score  Read(GB)  Write(GB)  W-Amp
  L0     4/0    256 MB    0.8       0.0       12.4    1.0
  L1     5/0    1.0 GB    1.0      12.4       24.8    2.0
  L2    51/0   10.1 GB    1.0      24.8       49.7    2.0
  L3   510/0  100.0 GB    1.0      49.7       99.3    2.0
  L4  5102/0    1.0 TB    1.0      99.3      198.6    2.0
  Sum 5672/0    1.1 TB    -       186.2      384.8    9.4

Sum W-Amp 9.4 reads as: for every 1 GB of new application data, 9.4 GB hit the disk over its lifetime in the tree. That's the number to track.

Compaction stalls

Compaction runs on a thread pool with bounded throughput. If writes arrive faster than compaction can keep up, the level-0 file count grows, the read amplification climbs, and eventually the engine has to push back on writers.

RocksDB exposes two thresholds: level0_slowdown_writes_trigger (default 20) starts artificially delaying writes when L0 has that many files; level0_stop_writes_trigger (default 36) blocks them entirely until compaction catches up. Cassandra has the equivalent in compaction_throughput_mb_per_sec. Set it too low and write latency spikes during ingest bursts.

The diagnostic is always the same: write latency suddenly jumps from microseconds to seconds, and the metrics show pending compactions climbing without bound. The fix is either to throttle ingest, raise compaction throughput (more threads, higher MB/s cap), or, most often, to scale the cluster.

Tombstones

Deletions in an LSM are just writes. A delete inserts a tombstone: a key with a marker that says "this is gone as of timestamp T". The tombstone shadows older values during reads and only gets physically removed when compaction can prove no older SSTable at any level still holds the key.

That proof is expensive, so tombstones live a long time. A workload that deletes heavily (TTL'd queues, soft-deleted user records, expired sessions) can fill SSTables with tombstones that slow scans badly. Cassandra's well-known "tombstone scan" pathology has a partition with millions of expired tombstones taking seconds to read; Cassandra throws a TombstoneOverwhelmingException at 100,000 by default.

Tombstone-heavy workloads need scheduled compaction. The default compaction strategy assumes a roughly even mix of writes and reads. A queue table or a soft-delete table breaks that assumption. Most of the bytes on disk are tombstones, and the engine never gets around to dropping them. The fixes, in order: (1) raise gc_grace_seconds only as high as your repair window requires; (2) switch to TWCS so whole files drop at TTL expiry; (3) run scheduled nodetool compact in off-hours. Don't run major compactions during peak; they pin disk bandwidth for hours.

Tuning levers

The knobs worth knowing, by engine:

EngineKnobWhat it does
Postgresautovacuum_max_workersHow many tables can be vacuumed in parallel (default 3, too few for >100 hot tables).
Postgresautovacuum_vacuum_scale_factorPer-table override; lower for high-churn tables.
Postgresautovacuum_vacuum_cost_limitThrottles vacuum I/O; raise on fast SSDs (default 200 is HDD-era).
RocksDBlevel_compaction_dynamic_level_bytesLets per-level sizes float to keep WAF bounded across data growth.
RocksDBmax_background_compactionsCompaction thread count; match to CPU and disk bandwidth.
RocksDBlevel0_slowdown_writes_triggerWhen to start throttling writes if compaction falls behind.
Cassandracompaction_throughput_mb_per_secPer-node MB/s cap on compaction I/O; 0 means uncapped.
Cassandraconcurrent_compactorsHow many compactions can run at once per node.
Cassandratombstone_threshold / tombstone_compaction_intervalTriggers compaction on SSTables with too many tombstones.

Default values for all of these assume mid-2010s hardware and modest scale. On modern NVMe with multi-TB tables, every default in the table above is wrong by at least one order of magnitude.

Further reading

Found this useful?