LSM-Trees Explained — Memtable, SSTables, Compaction, RocksDB

Two parts — memory and disk.

An LSM has a small, mutable, in-memory part (the memtable) and a large, immutable, on-disk part (a stack of SSTables — Sorted String Tables). Writes always land in the memtable. When the memtable fills, it's frozen, written to disk as a new SSTable, and a fresh memtable starts.

Writes  → [ memtable (skip-list, ~64MB) ]   ← active
              ↓ (full → flush)
           [ SSTable: L0/0001.sst    ]   ← immutable
           [ SSTable: L0/0002.sst    ]
           [ SSTable: L1/aaaaa.sst   ]   ← compacted
           [ SSTable: L1/bbbbb.sst   ]
           [ SSTable: L2/...         ]
                       ⋮

Two things make this work in production: (1) every write also goes to a write-ahead log so a crash before flush doesn't lose data, and (2) the memtable is usually a skip-list: sorted, lock-friendly, constant-time concurrent insert.

An SSTable — on disk.

An SSTable is exactly what it says: keys and values, sorted by key, written sequentially. Once written, it's never modified. The file layout is roughly:

+------------------------+
|  data block 0          |  ← sorted key/value pairs
|  data block 1          |
|  data block N          |
+------------------------+
|  index block           |  ← key → block offset
+------------------------+
|  <a class="il" href="/simulators/bloom-filter">bloom filter</a>          |  ← probabilistic "key in file?"
+------------------------+
|  metadata block        |  ← min/max key, timestamps
+------------------------+
|  footer                |  ← magic + offsets to above
+------------------------+

A point lookup needs at most one disk read for the data block, after the bloom filter and the index block (both small enough to keep in RAM). A range scan walks data blocks sequentially, which is the access pattern flash and HDD both reward.

Reads — visit every level.

A point lookup for key K proceeds:

Check the memtable. Hit? Return.
Check each immutable memtable (the ones that haven't flushed yet). Hit? Return.
For each SSTable, in newest-to-oldest order:
- Check the bloom filter. Definitely not present? Skip.
- Look up K in the index block.
- Read the data block; binary search for K.

This is read amplification: a single logical read may touch many SSTables. Bloom filters cut wasted I/O on misses. Compaction cuts the number of SSTables. The two are the only knobs that matter for read latency.

Compaction — the whole game.

Without compaction, SSTables accumulate forever and read amplification grows linearly. Compaction picks SSTables, merges them by key, drops obsolete versions and tombstones, and writes one (or several) replacement SSTables. The two strategies that ship in production:

Strategy	How it works	Trade-off
Leveled (RocksDB default, LevelDB)	Each level is N× larger than the previous. Each non-L0 level has non-overlapping key ranges. Compaction picks one file from level i and overlapping files from level i+1.	Low space amp, high write amp (each level rewritten ~10×).
Tiered (Cassandra default, RocksDB universal)	Each level holds N similarly-sized files. When N files exist at level i, they merge into one file at level i+1.	Low write amp, high space amp (multiple full copies coexist mid-compaction).

The RUM conjecture. You can pick at most two of read amplification, update (write) amplification, memory amplification. Leveled trades write-amp for read-amp. Tiered trades read-amp for write-amp. Hybrid policies try to straddle. There is no compaction strategy that wins on all three.

Tombstones — deletion is just a write.

Deletes in an LSM are writes too. A tombstone entry that says "this key is deleted as of timestamp T". The tombstone shadows older values during reads.

A tombstone can only be physically removed when the LSM is sure no older SSTable containing the same key still exists at any level. So tombstones often live for a long time, and a workload that produces many short-lived keys can pile up tombstones that slow scans badly. Cassandra has the famous "tombstone scan" pathology where a partition full of expired tombstones takes seconds to scan. The fix: run major compactions (or use the right TTL/grace settings).

Bloom filters — cheap, lossy "no".

A bloom filter is a small bitmap and a few hash functions. add(key) sets K bits; contains(key) checks them. False negatives are impossible; false positives have a tunable rate (typically 1%).

Every SSTable carries one. A point lookup that the filter rejects skips the file entirely. With a 1% false-positive rate, you needlessly read < 1% of files. With 0.1% (more bits), you waste 0.1%. Memory cost is roughly 10 bits per key for 1%, so a billion-key SSTable needs ~1.2 GB of bloom, too much to keep in RAM, so most engines tier their bloom filters too.

LSM vs B+ tree — when each wins.

Workload	Winner	Why
Random-key heavy writes	LSM	Sequential disk writes, no random page splits
Read-mostly with hot working set	B+ tree	One disk seek per lookup; LSM may visit many SSTables
Range scans over stable data	B+ tree	Linked-leaf walk vs merge across levels
Time-series append + range scan	LSM	Sequential writes; ranges land mostly in newest SSTables
Heavy delete + reinsert workload	Neither happily	B+ leaks index pages; LSM accumulates tombstones
Tiny embedded database	B+ tree (LMDB)	Single-file, no compaction overhead, small footprint

LSM-trees

Two parts — memory and disk.

An SSTable — on disk.

Reads — visit every level.

Compaction — the whole game.

Tombstones — deletion is just a write.

Bloom filters — cheap, lossy "no".

LSM vs B+ tree — when each wins.

Further reading.

01 — B-trees