Vacuum & compaction
MVCC databases like Postgres keep old row versions for snapshot reads. LSM databases like RocksDB and Cassandra append new values without touching the old ones. Both designs trade write speed for a debt that has to be paid in the background. Vacuum and compaction pay that debt. When they fall behind, production breaks.
Why dead data accumulates
Both MVCC and LSM are no-overwrite designs. They optimise the write path by refusing to modify data in place, and they hand the reclamation problem to a background process.
In Postgres MVCC, an UPDATE never overwrites the old row.
It writes a new tuple and marks the old one with xmax, the transaction ID
that obsoleted it. The old tuple stays on disk because some long-running snapshot might
still need to see it. A DELETE is the same shape: the row is marked, not
erased.
In an LSM tree, an update writes a new key/value pair to the memtable, which eventually lands in a fresh SSTable. The older value, sitting in some older SSTable at a lower level, is still on disk. Reads find the newest version first and stop there, but the old bytes are real.
Without reclamation, both engines bloat without bound. Postgres tables grow until the heap is mostly dead tuples; LSM trees grow until every read has to walk through every level. The work to keep them bounded is what this page is about.
Postgres VACUUM
VACUUM is the Postgres process that walks a relation, identifies tuples
whose xmax is older than every running transaction, and marks the space
they occupy as reusable in the free space map. It does three other things along the way
that matter just as much:
- Updates the visibility map so future queries can skip pages where every tuple is visible to all transactions. This is what makes index-only scans possible.
- Freezes old transaction IDs by replacing them with a special
FrozenXIDmarker, so the tuple becomes visible to every future transaction without consulting the XID clock. - Updates planner statistics if you ran
VACUUM ANALYZE.
AUTOVACUUM runs VACUUM in the background. It wakes up every
autovacuum_naptime (default 1 minute), looks at pg_stat_all_tables,
and launches a worker for any table where dead tuples exceed
autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × n_live_tup.
VACUUM vs VACUUM FULL
Regular VACUUM marks dead-tuple space as reusable inside existing pages.
The on-disk file size doesn't change; new inserts fill the holes. This is usually what
you want: it's online, it takes a non-blocking SHARE UPDATE EXCLUSIVE lock,
and it runs continuously without operator attention.
VACUUM FULL rewrites the entire table into a fresh file with no dead
tuples, then swaps the relfilenode. The on-disk size shrinks to the live-tuple size.
The cost: an ACCESS EXCLUSIVE lock for the duration, which blocks every
read and write. Only run it in a maintenance window, and only when bloat is so bad
that the reusable-space approach can't catch up.
The pg_repack and pg_squeeze extensions do the same shrink-on-disk work without the exclusive lock by keeping a shadow table and swapping at the end. They are the standard production answer when bloat needs to actually go away.
The bloat problem
Autovacuum tuning is the single most common Postgres-in-production issue. The default
autovacuum_vacuum_scale_factor of 0.2 means a table is
only vacuumed once 20% of its rows are dead. For a 1-million-row table that's fine.
For a 1-billion-row table that's 200 million dead tuples accumulating before a single
vacuum runs, and the vacuum itself takes hours.
The standard fix is per-table overrides on the hot tables:
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.02,
autovacuum_vacuum_threshold = 1000,
autovacuum_analyze_scale_factor = 0.01
);To see how bloated a table actually is, the pgstattuple extension reports
the live-vs-dead ratio exactly (at the cost of a full scan), and
pg_stat_user_tables.n_dead_tup gives a free estimate that's good enough
for monitoring. A healthy ratio sits below 20%; anything above 50% is a tuning
conversation.
XID wraparound
Postgres transaction IDs are 32-bit. The space wraps every ~4 billion transactions, and
MVCC uses modular arithmetic on a 2-billion-transaction window to decide which XIDs are
in the past and which are in the future. If a tuple's xmin falls more than
2 billion transactions behind the current XID, the database can no longer tell whether
the row is visible.
Vacuum's freeze step prevents this: it rewrites old xmin values to the
special FrozenXID sentinel, which is always visible. If autovacuum can't
keep up, usually because long-running transactions block it, or because vacuum
throughput is capped too low, the database approaches wraparound. At
autovacuum_freeze_max_age (200M by default) Postgres launches forced
anti-wraparound vacuums; at 2B it shuts down writes entirely and demands a single-user
vacuum.
xmin horizon, vacuum can't
advance, XID age climbs, and by the time alerting fires you have hours of forced
vacuum ahead of you. Alert on pg_database.datfrozenxid age, not on
vacuum lag.LSM compaction strategies
LSM compaction is the mirror-image problem. Where MVCC has dead row versions, LSM has obsolete SSTable entries: older keys shadowed by newer writes, tombstones, expired TTL data. Compaction reads N SSTables, merges them by key, drops the obsolete versions, and writes M (usually M < N) replacement files. The strategy you pick decides where on the read/write/space-amp triangle you land:
| Strategy | Where it lives | Read amp | Write amp |
|---|---|---|---|
| Leveled (LCS) | RocksDB default, Cassandra LCS | Low; non-overlapping ranges per level | High; each level rewritten ~10× |
| Size-tiered (STCS) | Cassandra default, RocksDB universal | High; multiple overlapping files per tier | Low; merge only when N similar-sized files exist |
| Time-window (TWCS) | Cassandra time-series | Low for time-range reads | Low; files segregated by write time, dropped whole on TTL |
Leveled fits read-heavy OLTP workloads on flash: read latency is bounded, the storage footprint is small. Size-tiered fits write-heavy ingest where read latency matters less. Time-window fits append-only metrics where the cleanup story is "drop the whole file when it ages out".
Write amplification
Every record in a leveled LSM is rewritten on every level it passes through. With 10 levels and a per-level fan-out of 10, a single logical write turns into roughly 10 physical writes to the underlying storage. RocksDB reports this as the write amplification factor (WAF), and in production it's usually between 5 and 30 depending on workload and level count.
This matters because SSDs have finite write endurance, usually rated in drive writes per day (DWPD) over a 3 or 5 year warranty. A consumer TLC drive at 0.3 DWPD with a database doing WAF=20 has effectively 0.015 DWPD of logical write budget. Pick the wrong compaction strategy on the wrong drive class and you burn out the disk before the warranty is up.
** Compaction Stats [default] **
Level Files Size Score Read(GB) Write(GB) W-Amp
L0 4/0 256 MB 0.8 0.0 12.4 1.0
L1 5/0 1.0 GB 1.0 12.4 24.8 2.0
L2 51/0 10.1 GB 1.0 24.8 49.7 2.0
L3 510/0 100.0 GB 1.0 49.7 99.3 2.0
L4 5102/0 1.0 TB 1.0 99.3 198.6 2.0
Sum 5672/0 1.1 TB - 186.2 384.8 9.4Sum W-Amp 9.4 reads as: for every 1 GB of new application data, 9.4 GB hit the disk over its lifetime in the tree. That's the number to track.
Compaction stalls
Compaction runs on a thread pool with bounded throughput. If writes arrive faster than compaction can keep up, the level-0 file count grows, the read amplification climbs, and eventually the engine has to push back on writers.
RocksDB exposes two thresholds:
level0_slowdown_writes_trigger (default 20) starts artificially delaying
writes when L0 has that many files;
level0_stop_writes_trigger (default 36) blocks them entirely until
compaction catches up. Cassandra has the equivalent in
compaction_throughput_mb_per_sec. Set it too low and write latency
spikes during ingest bursts.
The diagnostic is always the same: write latency suddenly jumps from microseconds to seconds, and the metrics show pending compactions climbing without bound. The fix is either to throttle ingest, raise compaction throughput (more threads, higher MB/s cap), or, most often, to scale the cluster.
Tombstones
Deletions in an LSM are just writes. A delete inserts a tombstone: a key with a marker that says "this is gone as of timestamp T". The tombstone shadows older values during reads and only gets physically removed when compaction can prove no older SSTable at any level still holds the key.
That proof is expensive, so tombstones live a long time. A workload that deletes
heavily (TTL'd queues, soft-deleted user records, expired sessions) can fill
SSTables with tombstones that slow scans badly. Cassandra's well-known
"tombstone scan" pathology has a partition with millions of expired tombstones taking
seconds to read; Cassandra throws a TombstoneOverwhelmingException at
100,000 by default.
gc_grace_seconds only as high as your repair window requires;
(2) switch to TWCS so whole files drop at TTL expiry; (3) run scheduled
nodetool compact in off-hours. Don't run major compactions during peak;
they pin disk bandwidth for hours.Tuning levers
The knobs worth knowing, by engine:
| Engine | Knob | What it does |
|---|---|---|
| Postgres | autovacuum_max_workers | How many tables can be vacuumed in parallel (default 3, too few for >100 hot tables). |
| Postgres | autovacuum_vacuum_scale_factor | Per-table override; lower for high-churn tables. |
| Postgres | autovacuum_vacuum_cost_limit | Throttles vacuum I/O; raise on fast SSDs (default 200 is HDD-era). |
| RocksDB | level_compaction_dynamic_level_bytes | Lets per-level sizes float to keep WAF bounded across data growth. |
| RocksDB | max_background_compactions | Compaction thread count; match to CPU and disk bandwidth. |
| RocksDB | level0_slowdown_writes_trigger | When to start throttling writes if compaction falls behind. |
| Cassandra | compaction_throughput_mb_per_sec | Per-node MB/s cap on compaction I/O; 0 means uncapped. |
| Cassandra | concurrent_compactors | How many compactions can run at once per node. |
| Cassandra | tombstone_threshold / tombstone_compaction_interval | Triggers compaction on SSTables with too many tombstones. |
Default values for all of these assume mid-2010s hardware and modest scale. On modern NVMe with multi-TB tables, every default in the table above is wrong by at least one order of magnitude.
Further reading
- Postgres — Routine vacuuming The canonical reference for VACUUM, autovacuum, and freeze management.
- RocksDB tuning guide Facebook's production tuning playbook for compaction, write buffers, and cache.
- Cassandra — Compaction LCS, STCS, TWCS, and when to pick which.
- 2ndQuadrant — Managing Freezing in PostgreSQL A deep dive on freeze tuning and the wraparound pathology.
- Heroku — Postgres essentials Operational lessons from a fleet that has been bitten by every vacuum failure mode.
- Semicolony — LSM-trees The structure compaction is operating on. Read this first if the SSTable layout isn't familiar.
- Semicolony — How SSDs work Write amplification on the database side compounds with write amplification on the NAND side. Both matter for drive lifetime.