LSM-trees
The Log-Structured Merge tree, O'Neil et al. 1996, is the storage shape underneath every major NoSQL database: RocksDB, LevelDB, Cassandra, ScyllaDB, HBase, Pebble. The pitch is to turn random writes into sequential writes by buffering in memory and merging on disk. The trade: amortise read cost across multiple files. The whole structure earns its keep on flash.
Two parts — memory and disk.
An LSM has a small, mutable, in-memory part (the memtable) and a large, immutable, on-disk part (a stack of SSTables — Sorted String Tables). Writes always land in the memtable. When the memtable fills, it's frozen, written to disk as a new SSTable, and a fresh memtable starts.
Writes → [ memtable (skip-list, ~64MB) ] ← active
↓ (full → flush)
[ SSTable: L0/0001.sst ] ← immutable
[ SSTable: L0/0002.sst ]
[ SSTable: L1/aaaaa.sst ] ← compacted
[ SSTable: L1/bbbbb.sst ]
[ SSTable: L2/... ]
⋮Two things make this work in production: (1) every write also goes to a write-ahead log so a crash before flush doesn't lose data, and (2) the memtable is usually a skip-list: sorted, lock-friendly, constant-time concurrent insert.
An SSTable — on disk.
An SSTable is exactly what it says: keys and values, sorted by key, written sequentially. Once written, it's never modified. The file layout is roughly:
+------------------------+
| data block 0 | ← sorted key/value pairs
| data block 1 |
| data block N |
+------------------------+
| index block | ← key → block offset
+------------------------+
| <a class="il" href="/simulators/bloom-filter">bloom filter</a> | ← probabilistic "key in file?"
+------------------------+
| metadata block | ← min/max key, timestamps
+------------------------+
| footer | ← magic + offsets to above
+------------------------+A point lookup needs at most one disk read for the data block, after the bloom filter and the index block (both small enough to keep in RAM). A range scan walks data blocks sequentially, which is the access pattern flash and HDD both reward.
Reads — visit every level.
A point lookup for key K proceeds:
- Check the memtable. Hit? Return.
- Check each immutable memtable (the ones that haven't flushed yet). Hit? Return.
- For each SSTable, in newest-to-oldest order:
- Check the bloom filter. Definitely not present? Skip.
- Look up
Kin the index block. - Read the data block; binary search for
K.
This is read amplification: a single logical read may touch many SSTables. Bloom filters cut wasted I/O on misses. Compaction cuts the number of SSTables. The two are the only knobs that matter for read latency.
Compaction — the whole game.
Without compaction, SSTables accumulate forever and read amplification grows linearly. Compaction picks SSTables, merges them by key, drops obsolete versions and tombstones, and writes one (or several) replacement SSTables. The two strategies that ship in production:
| Strategy | How it works | Trade-off |
|---|---|---|
| Leveled (RocksDB default, LevelDB) | Each level is N× larger than the previous. Each non-L0 level has non-overlapping key ranges. Compaction picks one file from level i and overlapping files from level i+1. | Low space amp, high write amp (each level rewritten ~10×). |
| Tiered (Cassandra default, RocksDB universal) | Each level holds N similarly-sized files. When N files exist at level i, they merge into one file at level i+1. | Low write amp, high space amp (multiple full copies coexist mid-compaction). |
Tombstones — deletion is just a write.
Deletes in an LSM are writes too. A tombstone entry that says "this key is deleted as of timestamp T". The tombstone shadows older values during reads.
A tombstone can only be physically removed when the LSM is sure no older SSTable containing the same key still exists at any level. So tombstones often live for a long time, and a workload that produces many short-lived keys can pile up tombstones that slow scans badly. Cassandra has the famous "tombstone scan" pathology where a partition full of expired tombstones takes seconds to scan. The fix: run major compactions (or use the right TTL/grace settings).
Bloom filters — cheap, lossy "no".
A bloom filter is a small bitmap and a few hash functions. add(key) sets
K bits; contains(key) checks them. False negatives are impossible; false
positives have a tunable rate (typically 1%).
Every SSTable carries one. A point lookup that the filter rejects skips the file entirely. With a 1% false-positive rate, you needlessly read < 1% of files. With 0.1% (more bits), you waste 0.1%. Memory cost is roughly 10 bits per key for 1%, so a billion-key SSTable needs ~1.2 GB of bloom, too much to keep in RAM, so most engines tier their bloom filters too.
LSM vs B+ tree — when each wins.
| Workload | Winner | Why |
|---|---|---|
| Random-key heavy writes | LSM | Sequential disk writes, no random page splits |
| Read-mostly with hot working set | B+ tree | One disk seek per lookup; LSM may visit many SSTables |
| Range scans over stable data | B+ tree | Linked-leaf walk vs merge across levels |
| Time-series append + range scan | LSM | Sequential writes; ranges land mostly in newest SSTables |
| Heavy delete + reinsert workload | Neither happily | B+ leaks index pages; LSM accumulates tombstones |
| Tiny embedded database | B+ tree (LMDB) | Single-file, no compaction overhead, small footprint |
Further reading.
- O'Neil et al. (1996) — The Log-Structured Merge-tree — the original paper, annotated on this site.
- LevelDB design — Sanjay Ghemawat & Jeff Dean's small reference implementation, the design every modern LSM borrows from.
- RocksDB wiki — the production-grade descendant; comprehensive docs on every compaction strategy.
- Stoica & Ailamaki — Improving Flash Write Performance by Using Update Frequency — the modern follow-up on flash-aware compaction.
- Petrov — Database Internals, ch. 7 — the most readable LSM walk-through outside the source code.
- Semicolony — How SSDs work — FTL, NAND cell types, write amplification. Why LSM trees evolved on flash and not on disk: their append-mostly write pattern keeps WAF near 1.0 on the underlying NAND.
- Semicolony — How CPU caches work — hot SSTable index pages live in L2/L3; cold ones miss to DRAM. The 80 ns DRAM tax shows up directly in your tail latency.