An LSM tree (log-structured merge-tree) is a write-optimised storage data structure that buffers writes in an in-memory sorted table (the memtable), flushes full memtables to immutable on-disk segments (SSTables), and periodically merges those segments (compaction) to bound read cost. It is used by RocksDB, LevelDB, Cassandra, ScyllaDB, HBase, Bigtable, and the storage layer of DynamoDB.

What is the difference between an LSM tree and a B-tree?

B-trees update in place — each write seeks and modifies a page, giving low read amplification but high write amplification. LSM trees only ever append — writes are sequential into the memtable and then flushed sequentially as SSTables — giving low write amplification but higher read amplification because a single key may live in several SSTables. Bloom filters on each SSTable cut the read cost dramatically. B-trees win for mixed read-write OLTP; LSM trees win for write-heavy, time-series, and append-mostly workloads.

An SSTable (Sorted String Table) is an immutable on-disk file containing key-value pairs sorted by key, plus an index block and (typically) a bloom filter. The format originated in Google's Bigtable paper and was carried into LevelDB, RocksDB, and Cassandra. Because SSTables are immutable, writers never lock readers, and the file format is friendly to compression and checksumming.

What is write amplification in an LSM tree?

Write amplification is the ratio of bytes physically written to disk over bytes inserted by the user. For an LSM tree with levelled compaction, each key is rewritten roughly L × T times, where L is the number of levels and T is the size ratio between levels — typically 10–30× total. Tiered compaction lowers write amplification but raises space amplification by keeping multiple overlapping SSTables per level.

New to this? · ELI5 · 1 min Read B-tree vs LSM-tree explained simply, in plain English

LSM Tree Simulator: when writes outpace reads.

A log-structured merge-tree buffers writes in memory, flushes them as sorted immutable segments, and merges those segments later through compaction, so writes stay sequential and fast while reads pay the cost. It's the structure RocksDB, Cassandra, and every write-heavy datastore reaches for.

Memtable

0/8

SSTables

Write amp

1.00×

Space amp

1.00×

Write

Read

Engine

Presets

Memtable (in-memory · skip list)

— empty —

SSTables on disk

0 tables · overlapping

— empty —

0 tables · disjoint

— empty —

0 tables · disjoint

— empty —

last read path last write sstable bloom bit set

Write amp

1.00×

0 B written / 0 B inserted

Read amp

0.00

0 tables / 0 reads

Space amp

1.00×

0 entries / 0 unique keys

Trace

— quiet —

Why append wins

A random update in a B-tree seeks and rewrites one page. A random update here goes into the memtable (a sequential RAM write) and eventually flushes sequentially to disk. Modern SSDs love sequential writes, hate random ones. That single design choice is why RocksDB exists.

Compaction is the cost

What you save on the write path you pay back during compaction. Each level rewrites every byte it holds, multiple times over the key's lifetime. The total write amplification is roughly L × T (levels times size ratio), usually 10–30× for production tuning.

Bloom filters everywhere

Each SSTable carries a small in-memory bloom filter. Before opening the file, the engine checks the filter: if it says "no", skip; if it says "maybe", load. About 10 bits per key buys a 1% false-positive rate, turning a miss from "check every SSTable" into "check almost none".

Tombstones, not deletes

You can't mutate immutable files. Deletes are inserts of a special "tombstone" marker. The marker propagates through compactions and finally vanishes when it reaches the bottom level, at which point no older copy of the key could still resurrect it.

Adjacent

What you're looking at

The top band is the memtable, your in-memory write buffer. Below it sit three on-disk levels of SSTables: L0 holds overlapping tables fresh off a flush, while L1 and L2 are sorted into disjoint key ranges. Each table shows its key range, a sample of its keys, and a row of dots that is its bloom filter. Put writes a key, Get reads one, and the three meters track write amplification (bytes written over bytes inserted), read amplification (tables touched per read), and space amplification (total entries over unique keys).

Hit Random 40 and watch a put: it lands in the memtable instantly, and every eighth one flushes a new L0 table with no extra cost to you. That is the fast write. Now Get a key and watch the read path light up table after table, top to bottom, until the bloom filters let it skip the rest. The surprise is the bill arriving late: compaction rewrites the same bytes again and again, so write amp climbs well past 1x for data you only inserted once.

What is an LSM tree?

The data structure modern databases use when write throughput matters.

An LSM tree (log-structured merge-tree) is a write-optimised storage data structure. The shape is simple: writes go into an in-memory sorted table called the memtable; when it fills, the engine flushes it to an immutable on-disk file called an SSTable; over time, a background process called compaction merges SSTables to keep read cost bounded. There are no in-place updates anywhere. Every byte the engine ever writes is sequential.

That sounds like an absurd indulgence. Surely the cost of"never overwrite, always append" must show up somewhere, and it does, in two places. Read cost rises because a single key may live in several SSTables (the engine has to check them in newest-first order until it finds the latest version), and total disk usage rises because old copies linger until compaction reclaims them. The LSM tree's design is a deliberate trade: pay more on reads and storage so writes are nearly free.

For workloads where writes dominate (telemetry, time-series, append-mostly logs, append-mostly KV stores, write-heavy NoSQL) that trade is the right one. RocksDB, LevelDB, Apache Cassandra, ScyllaDB, HBase, Google Bigtable, Amazon DynamoDB's storage layer, MyRocks, CockroachDB's Pebble fork, TiKV, InfluxDB's TSI, Prometheus' TSDB, Elasticsearch's translog: every one of these is built around an LSM tree (or a close variant). The shape is forty years old and dominates the write-heavy half of the database world.

The simulator above shows the three layers at work. Use Put to add a key; it lands in the memtable, drawn at the top. Watch the memtable fill; when it crosses the threshold, it flushes into L0 as a new SSTable. Click Compact L0 → L1 to merge the L0 tables into L1 (where ranges become disjoint), and again to push L1 into L2. Use Get to trace a read: the engine consults the memtable first, then each SSTable in newest-first order, checking the bloom filter before touching the file. The trace log records every step.

LSM vs B-tree: the trade-off

Two answers to the same question; the workload picks the winner.

The B-tree is the alternative for read-heavy workloads. It is the index every relational database (Postgres, MySQL InnoDB, Oracle, SQLite) uses by default, and it has held that position for fifty years. So why is the LSM tree even interesting? Because the B-tree's read-friendliness comes at the cost of random writes, and random writes on modern storage are not free.

A B-tree updates in place: every write seeks the correct leaf page and modifies it. A million scattered updates touch a million pages — a million random I/Os. On spinning disks that was catastrophic; on SSDs it's expensive (the flash translation layer relocates pages internally to wear-level, multiplying writes 5–20× at the firmware layer). An LSM tree never seeks. Writes accumulate in memory and flush sequentially. Random writes from the application become sequential writes to the device.

The cost shows up on reads. A B-tree read is one path from root to leaf: log_fanout(n) page reads, usually 3–4. An LSM read may have to consult the memtable, several L0 SSTables (which can overlap), and one SSTable per deeper level: a dozen file probes in the worst case before bloom filters kick in. With well-tuned bloom filters, most of those probes become cheap memory checks rather than disk reads, but the worst-case read still touches more than the B-tree.

Property	B-tree	LSM tree
Write pattern	in-place page mutation	append to memtable; flush sequentially
Write amplification	~1×	10–30× (depends on compaction)
Read amplification	~1× (one path)	1–10× (multiple SSTables, mitigated by bloom)
Space amplification	~1.5× (fragmentation)	1.1–2× (overlapping copies, tombstones)
Range scan	fast (leaf sibling links)	slower (merge across SSTables)
Best at	OLTP, mixed read-write	writes, time-series, append-mostly
Used by	Postgres, InnoDB, Oracle, SQLite, LMDB	RocksDB, Cassandra, ScyllaDB, LevelDB, HBase

If you're choosing for a real system: if your workload is OLTP with mixed reads and writes (typically more reads than writes) a B-tree is almost certainly the right answer; that is what every relational engine ships. If your workload is logs, time-series, telemetry, queues, or write-heavy KV (where the read pattern is predominantly key-based and writes dominate) an LSM tree is the right answer; that is what every modern write-heavy datastore ships. See the storage engine simulator for a head-to-head.

Origins: Patrick O'Neil et al., 1996

The paper that named the shape.

The LSM tree was introduced by Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil in The Log-Structured Merge-Tree (LSM-Tree), published in Acta Informatica vol. 33 in 1996. The motivating problem was the cost of indexing high-volume insert streams in OLTP and historical-data systems where the working set didn't fit in memory: billing records, call-detail records, audit logs. A B-tree index on a multi-billion-row append-mostly table was prohibitively slow on the hardware of the era: every insert was a random disk write.

The original paper described an LSM tree as a cascade of components C₀, C₁, ..., Cₖ, where C₀ is in memory and each subsequent Cᵢ lives on disk and is roughly r times larger than the one above. Inserts always land in C₀. When C₀ exceeds its size threshold, a rolling merge migrates entries down through the levels in batches. The mathematical contribution of the paper was to show that, with optimal level ratios, this design achieves lower total I/O for write-heavy workloads than a B-tree, because the amortised cost of sequential writes far outweighs the extra reads needed during merges.

The shape sat largely dormant in academia for a decade until Google's Bigtable (Chang et al., OSDI 2006) shipped a production LSM-style storage layer underneath its distributed table service, introducing the term SSTable as the on-disk file format. LevelDB (Sanjay Ghemawat and Jeff Dean, open-sourced 2011) extracted Bigtable's single-machine storage engine as a standalone embedded KV store. RocksDB (Facebook, forked from LevelDB in 2012) hardened LevelDB for production multi-threaded workloads on SSDs. Apache Cassandra (2008), HBase (2010), and ScyllaDB (2015) built distributed databases on the LSM model. Today the LSM tree is the dominant shape for any write-heavy datastore.

The three amplifications: write, read, space

Every storage engine has a triangle. LSM tuning moves the corners.

Storage-engine performance is dominated by three ratios. Write amplification is bytes physically written to disk divided by bytes the application asked to insert. Read amplification is the number of I/Os (or SSTables consulted) per logical read. Space amplification is total bytes on disk divided by bytes of live data. You cannot minimise all three simultaneously: the design space is a hard triangle, and every engine picks its corner.

For an LSM tree with levelled compaction, the math is concrete. Let L = number of levels, T = size ratio between adjacent levels (typically 10), and N = total data size. Write amplification is roughly L × T in the worst case (each key is rewritten that many times as it migrates down), or L on average. With L = 5 and T = 10, expect 30–50× write amplification end-to-end. That sounds catastrophic — but each rewrite is sequential, which is cheap, and the alternative (random B-tree page writes) hits a worse hidden multiplier at the SSD firmware layer.

Read amplification for a point lookup is at most L + 1 SSTables (memtable plus one per level), but bloom filters drop this dramatically. With well-sized bloom filters (~10 bits/key, 1% false-positive rate), the expected number of disk-touching SSTable reads per miss is below 1.05. The miss case is the common case for most workloads, so bloom filters are doing most of the work.

Space amplification depends on the compaction strategy. Levelled compaction keeps each level's keys disjoint and immediately overwrites older copies: space amp is typically 1.1–1.3× of live data. Tiered compaction keeps multiple overlapping SSTables per level and merges them only occasionally: space amp can climb to 2–3×, but write amp drops correspondingly. The choice is a knob the database operator gets to turn.

Amplification	Definition	LSM, levelled	LSM, tiered	B-tree
Write	bytes_written / bytes_in	10–30×	3–10×	~1× (+ FTL)
Read (no bloom)	tables / lookup	L+1	L × T	1
Read (with bloom)	disk_io / lookup	~1	~1	1
Space	bytes_on_disk / live	1.1–1.3×	2–3×	1.3–1.5×

Memtable: the in-memory write buffer

Where writes land before they reach disk.

The memtable is the LSM tree's only mutable component. Every write lands here first; reads check it before consulting any on-disk file. Because all writes hit a single in-memory structure, the memtable must be fast to insert, fast to point-read, and fast to iterate in key order (because the flush walks it sorted).

In practice almost every production LSM tree uses a skip list for the memtable. RocksDB defaults to its SkipListMemTableFactory; LevelDB uses a skip list; Cassandra's memtable is a Java ConcurrentNavigableMap backed by a skip list. The skip list offers O(log n) insert, point-lookup, and ordered iteration with constants comparable to a balanced BST, while being much friendlier to concurrent inserts (lock-free CAS-based insertion is simpler to write than for a red-black tree or B-tree).

Why not a B-tree for the memtable? Because the memtable lives in RAM, not on disk, the B-tree's cache-line and page-aligned story doesn't pay. The skip list's expected logarithmic structure is good enough, and its concurrency story (William Pugh's 1990 paper plus subsequent lock-free variants) is markedly cleaner. RocksDB exposes alternative memtable factories (hash-skip-list, hash-linked-list, vector) for specialised workloads, but the default skip list dominates production deployments.

The memtable also writes to a write-ahead log on commit, so a crash before flush doesn't lose data. The WAL is sequential and append-only; on startup the engine replays it into a fresh memtable. The LSM model and the WAL are independent layers (the LSM is a storage data structure, the WAL is a durability mechanism) but every production LSM engine pairs them.

# <a class="il" href="/simulators/skip-list">Skip list</a> — the canonical memtable structure (William Pugh, 1990)
# Each node has a random "height"; pointers at level i skip
# 2^i entries on average. Expected O(log n) search; supports
# concurrent inserts with simple CAS-based linking.
#
# RocksDB default:        SkipListMemTableFactory
# LevelDB:                arena-allocated skiplist
# Cassandra:              ConcurrentSkipListMap (Java stdlib)
#
# Memtable lives in RAM; backed by a WAL on disk for durability.
# On flush, the entire memtable is iterated in key order and
# written out as a single SSTable.

SSTable: the immutable on-disk file

Sorted, immutable, indexed, often compressed.

An SSTable (Sorted String Table) is the on-disk file format that backs every LSM tree. The term originated in the Bigtable paper (Chang et al., 2006). The format has four parts: a sequence of compressed data blocks, a data index (one entry per block, pointing at the block's offset and first key), a filter block (the bloom filter), and a footer with metadata and checksum. SSTables are immutable from the moment they're written. They are rewritten only by compaction, never modified in place.

Data blocks are typically 4–64 KB and compressed with Snappy, LZ4, or Zstandard. Compression is per-block, not per-file, so a point read decompresses one block (often a few KB) rather than the whole file. Within a block, keys share prefixes via restart-point delta encoding: a 32-byte URL key may take only 2–4 bytes after the first occurrence in the block. Production engines see 3–5× compression ratios on typical relational and document payloads.

The data index is a sparse summary of the file: one entry per data block, not per key. A read against an SSTable walks the index in memory (binary search) to find the candidate block, then reads exactly one block from disk. A 256 MB SSTable with 4 KB blocks has 64,000 index entries, comfortably small enough to keep in memory. The bloom filter sits alongside; it's checked before the index, so a miss never even reads the index.

Immutability is the key property. Because no one rewrites an SSTable, no one needs to lock it. Readers can stream from it concurrently with no coordination. Crash safety is trivial: if a flush is interrupted, the partial SSTable is discarded; the memtable still has the data (it's also in the WAL); the next flush will produce a complete file. This single design choice (immutable on-disk files) eliminates whole categories of concurrency bugs that plague B-tree engines.

# SSTable file layout (LevelDB/RocksDB format)
# +--------------------------------------+
# | [data block 1] [data block 2] ...    |  ← compressed, 4-64 KB each
# | ...                                  |
# | [filter block: <a class="il" href="/simulators/bloom-filter">bloom filter</a>]         |  ← 10 bits/key typical
# | [meta-index block]                   |
# | [data index block]                   |  ← one entry per data block
# | [footer: magic + offsets + checksum] |
# +--------------------------------------+
#
# Read path: footer → bloom → data index → one block → decompress → linear search

Compaction strategies: levelled, tiered, universal

The single biggest knob on the read-vs-write trade-off.

Levelled compaction is the default in RocksDB and LevelDB. Each level (L1 and below) holds disjoint key ranges across its SSTables: at most one SSTable per level contains any given key. When level L grows past its target size, the engine picks one SSTable from L, finds the overlapping range in L+1, and rewrites the union as a fresh set of SSTables at L+1. The disjoint invariant gives low space amplification (~1.1–1.3×) and bounded read amplification (~L+1 SSTables maximum per read), at the cost of high write amplification (~L × T).

The math for levelled write amplification is approximately write_amp ≈ L × T, where L is the number of levels and T is the size ratio between adjacent levels (typically 10). For a 1 TB dataset with T=10 and L0 at 256 MB, you have L ≈ log₁₀(10⁶) ≈ 6 levels; total write amplification is roughly 60×. Sounds terrible, but each write is sequential and the absolute throughput is still vastly better than random B-tree updates.

Tiered compaction (Cassandra's SizeTieredCompactionStrategy, the default) keeps multiple overlapping SSTables per level. When a level accumulates enough similarly-sized SSTables, the engine merges them into one larger SSTable at the next level. Tiered slashes write amplification (each key is rewritten only when its tier compacts, ~3–10× total) but raises space amplification (overlapping tables hold duplicates; 2–3× space amp is common) and read amplification (a read may consult several SSTables per level). Cassandra also offers LeveledCompactionStrategy for read-heavy tables.

Universal compaction (RocksDB option) is a hybrid: it keeps SSTables of geometrically increasing size in a single level. New flushes append small SSTables; older SSTables are larger. Compaction merges adjacent SSTables when their sizes drift out of the geometric ratio. Universal trades a slightly higher read amplification for lower write amplification than levelled while keeping space amplification close to levelled's.

Strategy	Used by (default)	Write amp	Read amp	Space amp
Levelled	RocksDB, LevelDB	10–30×	L+1	1.1–1.3×
Tiered (size-tiered)	Cassandra, ScyllaDB	3–10×	L × T	2–3×
Universal	RocksDB (option), MyRocks	5–15×	medium	1.3–1.8×
FIFO	RocksDB (TTL workloads)	1×	high	1×
Time-window	Cassandra TWCS	3–8×	low for time-series	1.2×

Bloom filters: the read-path saver

~10 bits per key buys 99% fewer disk reads on misses.

The single most important optimisation on the LSM read path is the per-SSTable bloom filter. Without it, a read for a key that doesn't exist has to consult the memtable plus every SSTable in every level: potentially a dozen file probes for a definitive"no". With it, most of those probes are eliminated by a memory check before the file is even opened.

A bloom filter is a fixed-size bit array with k hash functions. Insert: hash the key k times and set those k bits. Test: hash the key k times; if all k bits are set, the key might be present; if any bit is unset, the key is definitely absent. The asymmetry (guaranteed"no" but probabilistic "yes") is exactly what an LSM tree needs. A false positive costs one extra SSTable read; a false negative would lose data, and bloom filters never produce false negatives.

The math is tight. For n keys, m bits, and k optimal hash functions, the false-positive rate is approximately (1 − e^−kn/m)^k. Plugging in: 10 bits per key (m/n = 10) with k=7 gives ~1% false positives; 16 bits per key gives ~0.05%; 20 bits per key gives ~0.005%. RocksDB defaults to 10 bits/key, configurable via bloom_filter_policy. The memory cost for a billion-key dataset at 10 bits/key is ~1.25 GB, easily affordable.

The classical bloom filter has limitations: it can't be resized, can't delete keys, and its false-positive rate is fixed at construction time. Ribbon filters (Dillinger and Walzer, 2021), now the RocksDB default for new SSTables, offer the same false-positive rate at lower memory cost (about 30% less). Cuckoo filters support deletion. XOR filters are static but faster to query. The bloom-filter family is the read-path's quiet hero.

Why bloom filters can't lie one way

A bloom filter can return "maybe yes" when the answer is no (false positive: costs an extra disk read; not catastrophic). It cannot return "no" when the answer is yes. That would lose data. The construction guarantees this: if a key was inserted, its k bits are set; if a test finds all k bits set, the result is "maybe"; if any of the k bits is unset, the key was never inserted, full stop. The asymmetry is structural, not statistical.

Reads: what the worst case looks like

Memtable, then newest SSTable downward, until the first hit.

A read in an LSM tree always returns the latest version of a key, the version most recently written, whether that lives in the memtable or in some SSTable on disk. The read protocol enforces this by consulting sources in newest-first order and stopping at the first match.

The order is: memtable → immutable memtable (during flush) → L0 SSTables (newest to oldest) → L1 → L2 → .... The memtable is always checked first because it holds the newest writes. L0 SSTables can overlap in key range (they were each a separate memtable flush), so they must be checked individually, newest first. L1 and below have disjoint key ranges within a level, so at most one SSTable per level can contain the key; the engine binary-searches the level's table-list to find it.

Without bloom filters, the worst case for a point lookup that misses every SSTable is 1 + L₀ + (NUM_LEVELS − 1) probes: one for the memtable, one per L0 SSTable, and one per deeper level. For a default RocksDB configuration with 4 L0 SSTables and 6 levels, that's 11 file probes for a definitive "key absent". Bloom filters reduce this to roughly 1.05 file probes on average; almost every probe is a memory-only bloom check that immediately returns "no".

The hit case is asymmetric. A key in the memtable returns instantly (one in-memory lookup). A key in L0 needs a memtable check plus one or more L0 lookups; a key deep in L2 or L3 needs to traverse every level above it. Worst-case read latency for a hit is therefore proportional to how deep the key has migrated; newer keys are faster to read than older ones. This is a structural property of LSM trees with implications for cache design: a small block cache disproportionately helps hot recent keys.

Range scans are more painful. A scan must merge sorted streams from every level (memtable plus each SSTable that overlaps the range) using a heap-based merge iterator, deduplicating by key with the newest value winning. The complexity is O((k + L) log L) where k is the number of result keys and L is the number of sources. For wide scans this is slower than a B-tree's leaf-link walk, which is one reason analytical workloads on Cassandra and friends benefit from secondary indexes, materialised views, or external OLAP.

Deletes: why LSM trees can't delete in place

The cost of immutability.

SSTables are immutable. You cannot reach into a file and remove a key: there is nothing to overwrite, no in-place mutation. Instead, the LSM tree treats deletes as a special kind of write: a tombstone, a marker that says "this key is dead". The tombstone propagates through the system exactly like any other write: it lands in the memtable, flushes to an L0 SSTable, migrates through compactions toward the bottom level.

A read encountering a tombstone before any other version of the key returns "not found"; the tombstone shadows older copies. During compaction, when a tombstone meets an older value with the same key in a deeper SSTable, the compactor drops both. Eventually, when the tombstone reaches the bottom level (the level with no deeper SSTable that could still hold an older copy), the compactor drops the tombstone itself. This is the only point at which the disk reclaims the space.

The implication for operators: a delete-heavy workload generates a lot of tombstones that linger until they reach the bottom level. Cassandra's tombstone_warn_threshold (default 1000) and tombstone_failure_threshold (default 100000) exist because a single partition with too many tombstones makes reads expensive: every read still has to walk every tombstone to confirm the key is dead. Range scans become particularly painful: a range with a thousand tombstones forces the read to materialise and discard each one.

The classic Cassandra anti-pattern is using LSM tables as a queue. Insert messages, mark them processed by deleting them, observe latency degrade exponentially as the tombstone backlog grows. The fix is either to use a different storage shape (Kafka, Redis Streams) or to lean on TTL: Cassandra's USING TTL and RocksDB's CompactOnDeletion hint let the engine drop entire SSTables once they age out, bypassing the tombstone propagation entirely.

A useful diagnostic

If your LSM-backed table has read latencies climbing while write throughput is steady and the data size is roughly flat, suspect tombstones. In Cassandra, nodetool cfstats reports tombstones per read; in RocksDB, db_bench and the LOG file expose tombstone counts. A 10× tombstone-to-live ratio on a hot partition is an incident, not a quirk.

When LSM wins, and when it doesn't

Workload picks the shape; not the other way round.

LSM wins for write-heavy workloads. Telemetry ingest, application logs, audit trails, IoT sensor streams, message brokers, write-once-read-rarely archives. Every write is sequential; compaction can be tuned, throttled, scheduled. RocksDB sustains 200k–1M writes per second per node on commodity hardware; Cassandra's tiered compaction scales linearly with cluster size. If your write rate is high relative to your read rate, an LSM tree is the answer.

LSM wins for time-series. Time-series data arrives roughly in time order, ages out after a TTL, and is queried mostly by recent range. Cassandra's Time-Window Compaction Strategy (TWCS) exploits this: it buckets SSTables by their write timestamp, only compacts within the same window, and drops entire old windows when they expire. InfluxDB's TSI and Prometheus' TSDB use specialised LSM variants for the same reason.

LSM wins for append-mostly KV. Session stores, feature stores, write-mostly caches, blob metadata. If updates are rare and reads are mostly by primary key, the LSM tree's lower write amplification dominates.

LSM loses on wide range scans. The merge across multiple SSTables makes wide scans slower than a B+ tree's leaf-link walk. Cassandra users routinely add materialised views (which precompute the scan offline) or punt to an OLAP system. Snowflake, BigQuery, ClickHouse: all use columnstore formats instead.

LSM loses on low-latency single-key reads with cold cache. The B-tree's bounded read path (always one root-to-leaf traversal) gives more predictable tail latencies than an LSM tree where a single read might consult several SSTables. For mission-critical OLTP with strict p99 latency SLOs, B-trees still hold the field. Postgres, MySQL InnoDB, and Oracle aren't going anywhere.

LSM loses on workloads that update the same key repeatedly. The LSM model is wasteful here: every version lives somewhere until compaction reclaims it. A B-tree updates in place; one row of state, one page write. For counters, leaderboards, and other hot-update workloads, consider an in-place engine (Redis, Postgres) or a specialised counter (Riak's CRDT counters).

Real systems and tuning: RocksDB, Cassandra, ScyllaDB

The knobs that move the needle in production.

RocksDB is the LSM-tree reference. It powers MyRocks, CockroachDB's Pebble fork, TiKV, ArangoDB, YugabyteDB, and dozens of internal Facebook/Meta systems. Five knobs dominate tuning. max_bytes_for_level_base sets the L1 target size (default 256 MB); levels below scale by max_bytes_for_level_multiplier (default 10). level_compaction_dynamic_level_bytes (enable this: the default is off for legacy reasons but on is better) adjusts level sizes to keep the bottom-level large and shallow, reducing write amp by ~30%. compaction_pri selects which SSTable to compact next; kMinOverlappingRatio is the modern default and best for most workloads.

write_buffer_size (memtable size, default 64 MB) and max_write_buffer_number (how many immutable memtables can queue before writes stall, default 2) together control write latency under load. Larger memtables = larger SSTables = less compaction work, but more memory pressure and longer flush stalls.

ScyllaDB is Cassandra's spiritual successor: a C++ rewrite using a shared-nothing thread-per-core model on top of Seastar. It defaults to size-tiered compaction (low write amp at the cost of higher space amp), which matches Cassandra's defaults and the workloads that drove the Cassandra design. ScyllaDB users tune memtable_total_space_in_mb, compaction_throughput_mb_per_sec (Scylla autotunes by default), and the per-table compaction strategy.

Cassandra exposes the most operator-visible LSM controls of any production system. Per-table you can choose between SizeTieredCompactionStrategy (default; low write amp), LeveledCompactionStrategy (lower space amp, higher write amp; good for read-heavy tables), and TimeWindowCompactionStrategy (time-series). compaction_throughput_mb_per_sec throttles background compaction to leave I/O for foreground reads; the default of 16 MB/s is conservative on modern SSDs and most operators raise it.

LevelDB is the simpler precursor; widely embedded but rarely operationally tuned. Its design choices (single-threaded compaction, snappy-only compression, no column families) make it a clean reference implementation but a poor production engine compared to RocksDB. Read the LevelDB code first if you're learning the LSM model; run RocksDB if you're shipping production.

Pebble (Cockroach Labs) is a Go reimplementation of RocksDB designed for Go's GC and concurrency model; it ships as CockroachDB's default storage engine. WiredTiger (MongoDB) is hybrid B+tree-LSM. FoundationDB's Redwood storage engine is a B+tree with LSM-inspired update buffering. The pure-LSM-vs-pure-B-tree divide is increasingly an oversimplification of what production storage engines actually do.

LSM Tree Simulator: when writes outpace reads.

What is an LSM tree?

LSM vs B-tree: the trade-off

Origins: Patrick O'Neil et al., 1996

The three amplifications: write, read, space

Memtable: the in-memory write buffer

SSTable: the immutable on-disk file

Compaction strategies: levelled, tiered, universal

Bloom filters: the read-path saver

Reads: what the worst case looks like

Deletes: why LSM trees can't delete in place

When LSM wins, and when it doesn't

Real systems and tuning: RocksDB, Cassandra, ScyllaDB

Further reading on LSM trees