B-tree vs LSM-tree Storage Engine Visualizer: two read and write paths.

B-trees mutate in place: predictable reads, costly random writes. LSM trees append: cheap writes, more work to read. Pick by your write/read ratio.

Engine
LSM

WAL · append-only durability log
000 k1=A
001 k3=C
002 k2=B
003 k5=E
004 k4=D
005 k7=G
006 k6=F
Memtable · sorted in-memory map
k4=Dk6=Fk7=G
SSTables · immutable on disk (newest first)
L0
k1=Ak2=Bk3=Ck5=E
Log
PUT k6=F → WAL + memtable
PUT k7=G → WAL + memtable
PUT k4=D → WAL + memtable
memtable flushed → new SSTable
PUT k5=E → WAL + memtable
PUT k2=B → WAL + memtable
PUT k3=C → WAL + memtable
PUT k1=A → WAL + memtable

What you're looking at

One key-value store, two engines you switch between. In LSM mode the screen shows the three places a write lands: the append-only WAL for durability, the sorted in-memory memtable, and the stack of immutable SSTables on disk, newest first. In B-tree mode there is just one sorted structure, edited in place. Type a key and value and hit Put, or Get to look one up; Seed loads a handful of keys at once. The log at the bottom narrates each operation and tells you where a Get found its hit.

Seed the store in LSM mode and watch the memtable fill; on the fourth key it flushes to a new SSTable and empties. Now Get a key written early — the log shows it found in an SSTable, after the memtable was checked and missed. Switch to B-tree and Put the same keys: every write slots straight into one sorted list, no flushing, no layers. What should surprise you is the asymmetry: the LSM turned cheap appends into extra work at read time, while the B-tree did the opposite. That is the central trade between the two designs.


What is a storage engine?

Where to put the bytes.

Imagine you're building a key-value store. The user calls put("alice", 42) and expects, a millisecond later, a successful return. The user also expects get("alice") to return 42. Behind the scenes you have a disk — a few hundred GB of NVMe SSD, say — and a small amount of RAM, maybe 16 GB. The question is what to do with those bytes.

The simplest answer — “keep a hash map in memory and write the disk file from scratch on every change” — fails immediately. Even a billion-entry map is a few hundred GB, and rewriting a 200 GB file on every put would burn through SSD wear in a week. The next instinct — “keep the data sorted in pages on disk and edit the right page on every put” — gets us closer. That's the B-tree: an ordered tree where each node holds a few hundred keys plus pointers to children, edited in place when keys change. B-trees were invented in 1972 (Bayer and McCreight at Boeing) for exactly this problem; almost every relational database uses one — Postgres, MySQL/InnoDB, SQLite, SQL Server — for its primary storage.

B-trees are great at reads. A point lookup on a billion rows touches only three to five 16 KB pages because the fan-out is so high. But updates are random: writing key “alice” means seeking to wherever “alice” lives in the tree and modifying that page. On a write-heavy workload — say, ten thousand puts per second — you're doing ten thousand random 16 KB writes per second on the disk. Spinning rust hated this; even modern SSDs slow down once enough random updates accumulate.

The other answer was published in 1996 by Pat O'Neil and three coauthors: never edit anything in place. Append every put to an in-memory buffer; once the buffer fills, flush it to disk as one big sorted file; merge the files in the background. Reads have to consult several files, but they're sequential. Writes are pure appends — the fastest thing a disk can do. This is the Log-Structured Merge-Tree (LSM), and it's what powers Cassandra, RocksDB, LevelDB, ScyllaDB, CockroachDB, TiKV, and the message-queue state stores in Kafka.

Neither design is universally better. B-trees pay one disk write per logical write but read in O(log n); LSM trees pay 10–30× write amplification (each key gets rewritten by background compaction) but absorb writes at sequential-IO speed and serve reads with bloom filters that almost always hit the right file. The choice between them is a deep design decision and is the topic of every modern database textbook. The simulator above lets you watch both write paths side by side: identical input, completely different on-disk shape.

B-TREE vs LSM · WRITE PATH FOR THE SAME PUT(KEY, VALUE)B-TREEput(alice, 42)find leaf for “alice”RANDOM 16KB writeLSMput(alice, 42)memtable + WAL appendSEQUENTIAL appendSAME INPUT · ONE EDITS A PAGE · ONE APPENDS TO A LOG

B-tree vs LSM-tree — two ways to put bytes on disk

Two ways to put bytes on disk.

Every transactional storage engine in production today descends from one of two ideas. The B-tree (Bayer and McCreight, Boeing Scientific Research Labs, 1972) edits the page where the key lives — one disk write per update. If the page is full it splits; concurrent writers fight for the same page. The Log-Structured Merge-Tree (O'Neil, Cheng, Gawlick, O'Neil, ACM TODS, 1996) appends every write to an in-memory buffer (the memtable), flushes that buffer to immutable sorted files (SSTables), and merges files in the background. Writes never block on a page lock; reads must consult several files.

The trade-off matters because storage hardware has changed underneath both designs. In 1972 a random disk seek cost 25 ms; sequential reads were also expensive. The B-tree's high fan-out (200–500 keys per node) was a direct response: any data structure forced into 8–30 random reads per lookup was unusable. The 1996 LSM paper noted that disks had become better at sequential than random by perhaps 100×, and proposed exploiting that asymmetry by writing only sequentially. Modern NVMe SSDs have closed the random-vs-sequential gap to roughly 10×, and DRAM-backed working sets often dominate either choice anyway, but the design split persists.

PropertyB-treeLSM-tree
Read ampLow — one page chain.High — checks every level.
Write amp~2× (page + WAL).10–30× (compactions).
Space amp~1.3× (fragmentation).~1.1× (after compaction).
Best atRead-heavy, random IO.Write-heavy, sequential IO.

Production engines blur these categories. WiredTiger (MongoDB's default since 4.0, December 2018) is a B+tree but borrows the LSM playbook for its journal. MyRocks (Facebook, 2016) bolts RocksDB's LSM behind MySQL's SQL surface, getting LSM write economy with B-tree-shaped queries. SQL Server columnstore uses a delta-store front end (LSM-shaped) feeding into compressed columnar segments. The pure-B-tree-vs-pure-LSM dichotomy is increasingly an oversimplification of what production storage engines actually do.

A simulator like the one above can't capture the full complexity, but it does illustrate the core asymmetry: every B-tree write touches the leaf page directly, with attendant locking and possible split; every LSM write goes to a small in-memory map plus a sequential log. After many writes, the LSM has accumulated multiple SSTables that reads must merge; the B-tree has the same on-disk shape it would have had with one write. The two designs trade work between write time and read time, with compaction in the LSM and page splits in the B-tree as the deferred costs.


Read, write, and space amplification — pick two

Pick two, always pay for the third.

Every storage engine is shaped by the trilemma between read amplification, write amplification, and space amplification. Read amp is bytes read per logical read — ratio of physical bytes pulled from storage to bytes the user actually wanted. Write amp is bytes written per logical write, including WAL, page rewrites, and compaction merges. Space amp is bytes on disk per logical byte, including fragmentation and obsolete data not yet reclaimed. You can optimise any two; the third pays for it.

RUM conjecture (Athanassoulis et al., 2016)

A formal version of the trilemma, presented at EDBT 2016 by Athanassoulis, Kester, Maas, Stoica, Idreos, Ailamaki, and Callaghan. The paper proves no general-purpose data structure can simultaneously minimise all three; every real engine picks two and accepts the third. B-trees minimise read + space; LSM-trees minimise write + space; bitmap indexes minimise read + write. The conjecture has held up in subsequent literature.

LSM write amplification is large and worth understanding. A key written once to the memtable will, in a typical leveled-compaction LSM, be rewritten roughly k × L times, where L is the number of levels and k is the size ratio between levels. RocksDB's default L=7, k=10 yields about 70× write amp in the worst case — though typical workloads see 10–30× because not every key reaches the deepest level. This is real disk wear: an SSD rated for 1 PB of writes degrades 30× faster under an LSM than under a B-tree of the same logical write rate.

B-tree read amp is bounded and predictable. A point lookup on a billion-row table touches height pages, typically 3–5 with InnoDB's 16 KB pages and 200–300 keys per internal node. Range scans hit one extra page per B rows, where B is the leaf fanout. LSM read amp is more variable: a point lookup checks the memtable, then every level, with bloom filters reducing average cost; the worst case is 1 + L probes. Range scans on LSM are particularly expensive because every level must contribute its overlapping keys to the merged stream.

Space amp deserves a little more attention. B-tree fragmentation accumulates as pages split unevenly under random insertion; the steady-state load factor for a B-tree built by random inserts is about 70%, meaning the on-disk size is roughly 1.4× the “packed” size. After bulk reload (or CLUSTER in Postgres), pages are nearly full again. LSM space amp depends on compaction policy: leveled compaction keeps space amp around 1.1× (only one copy per level); size-tiered allows 2× or worse because obsolete versions linger across multiple tiers. The Cassandra blog has good visualisations of this if you've never seen the phenomenon plotted over time.

LSM TREE · MEMTABLE → WAL ON DISK → SSTABLES IN LEVELS L0…L2memtable (RAM)WAL (append)L0 (overlap)sstsstsstsstL1 (sorted, non-overlapping)a..hi..pq..zL2 (~10× larger than L1)READ PATH · CHECK IN ORDER, PROBE BLOOM PER LEVEL1. memtable2. L0 (probe each, may overlap)3. L1 (binary-search 1 sstable)4. L2 (binary-search 1 sstable)avg ≈ 1 disk read with bloom hit

Storage engines in the wild — names and lineages

Names and lineages.

  • InnoDB · MySQL/MariaDB. Clustered B+tree primary index, 16 KB pages, redo log + undo log + change buffer. Probably the most-deployed B-tree in the world: hundreds of millions of MySQL instances run it.
  • Postgres heap + B-tree · Postgres separates a heap (unordered table file) from B-tree indexes that point into the heap. MVCC tombstones live in the heap (the famous “vacuum” problem). 8 KB pages by default; nbtree implementation cites Lehman & Yao 1981 directly in the source.
  • RocksDB · Embedded LSM by Facebook (2012, fork of LevelDB), still the dominant LSM library. Powers MyRocks, CockroachDB (until 2020), TiKV, YugabyteDB, Kafka Streams' state store, and many internal Facebook systems. Pluggable compaction strategies; written in C++.
  • LevelDB · Google, 2011. The simpler ancestor of RocksDB, written by Sanjay Ghemawat and Jeff Dean. Single-writer, no replication, no transactions across keys; the kernel of the LSM design space and a 6,000-line C++ exemplar.
  • Pebble · CockroachDB's pure-Go RocksDB rewrite (2020). Same LSM model, drop-in compatible SSTable format, better operational story for Go services. Cockroach migrated production fleets without downtime.
  • WiredTiger · MongoDB's default storage engine since version 4.0 (2018). A hybrid: B+tree row store as default, columnar store optionally, with LSM techniques in the journal. Acquired by MongoDB in 2014 from the WiredTiger team led by Keith Bostic.
  • Cassandra SSTable · Pure LSM, originally inspired by both Bigtable and Dynamo. Files are immutable per memtable flush; compaction strategies (size-tiered, leveled, time-window) are configurable per table.
  • SQLite · Single-file B-tree. The most-deployed database engine on Earth — every browser, every iOS app, every Android app, every aircraft system that needs persistence. The internal data structure is a B+tree variant, plus rollback journal or write-ahead log for atomicity.

One worth-mentioning outlier: LMDB (Symas, 2011) and BoltDB (Ben Johnson, 2014, archived) are copy-on-write B-trees rather than mutate-in-place. They never modify a page; every write clones the path from root to leaf, atomically swapping the new root in. The result: trivial crash safety, snapshots for free, no separate WAL needed. The trade-off: every write touches O(log n) pages, doubling space usage during long readers. Used in etcd's bbolt fork and FoundationDB's storage engine.

And another: Bw-tree, designed at Microsoft Research for SQL Server's Hekaton in-memory engine (Levandoski, Lomet, Sengupta, ICDE 2013). Updates append delta records to a per-page log; periodically the system consolidates the deltas back into a fresh base page. Latch-free reads and CAS-based writes make it well-suited to high-concurrency workloads on large core counts. Used in production by SQL Server's in-memory OLTP engine and the CMU NoisePage research database. The cost is implementation complexity: Bw-tree codebases are notoriously hard to modify, even by their authors.

Tokutek's Fractal Tree (now part of Percona's TokuDB and TokuMX) sits between B-tree and LSM: internal nodes carry message buffers; inserts write to the root buffer; messages flush downward when buffers fill. Drastically lower write amplification than a vanilla B-tree, lower than LSM in many configurations, at the cost of a more complex flush scheduler. TokuDB shipped commercially before LSM trees won the write-heavy market; the algorithmic ideas live on in academic literature even where the products have faded.


LSM-tree compaction — how LSMs stay healthy

How LSMs stay healthy.

Compaction is the LSM's housekeeping. Without it, an LSM grows unbounded: every memtable flush adds a new SSTable; reads scan more files; space amplification increases. Three compaction strategies dominate, each picking a different balance among the three amplifications.

Size-tiered compaction (Cassandra's default) groups SSTables by similar size. When N files of similar size accumulate, they are merged into one larger file. Read amp is moderate (typically 4–8 files probed per read); write amp is low (each key is rewritten O(log n) times); space amp is high — obsolete versions of a key linger across multiple tiers until merged. Best for write-heavy workloads with limited read concurrency.

Leveled compaction (LevelDB and RocksDB's default) maintains a strict size invariant: level L is roughly 10× the size of level L−1, and within each level keys are non-overlapping. Read amp is low (one file per level, plus the memtable); space amp is low (only one copy per level); write amp is high — a key descending from level 1 to level 6 is rewritten roughly 50×. Best for read-heavy workloads with bounded space.

Universal compaction (RocksDB option) is a tunable middle ground: like size-tiered, but with explicit limits on the number of sorted runs. RocksDB also offers FIFO compaction for time-series workloads where the oldest data is dropped wholesale. Time-window compaction in Cassandra (3.0+, designed for time series like metrics) groups SSTables by time bucket so old data is compacted together and dropped together.

The choice of compaction strategy is the single largest knob on an LSM's behaviour. Switching from size-tiered to leveled in Cassandra cuts read latency by 5–10× on read-heavy workloads but doubles disk write bandwidth. The advice from RocksDB's wiki: leveled for OLTP, universal for write spikes you can't tolerate stalling, FIFO or TWCS for time-series. Production teams running RocksDB-backed databases (TiKV, Cockroach pre-Pebble, Yugabyte) spend significant time tuning compaction; ScyllaDB's blog has many deep dives on the topic.

A failure mode worth flagging: compaction backpressure. If compaction can't keep up with ingestion, L0 grows unbounded, reads slow down, and the system enters a write stall. RocksDB exposes level0_slowdown_writes_trigger and level0_stop_writes_trigger precisely for this; Cassandra has analogous knobs. Operators who don't understand their compaction throughput can be surprised by sudden write latencies of seconds during normal traffic spikes. The fix is more cores or faster disks for compaction, not write-rate limiting in the application.

B-TREE vs LSM · WRITE PATHS COMPAREDB-TREE · UPDATE-IN-PLACEPUT key, valueWAL append + page rewriteRANDOM 16KB I/O · ~1× write ampLSM · APPEND, FLUSH, COMPACTPUT → memtable + WALflush memtable → L0 sstableSEQUENTIAL I/O · 10–30× write amp

Bloom filters and tiered storage — avoiding the disk

Avoiding the disk whenever possible.

LSM read amp would be unbearable without bloom filters. Each SSTable carries a bloom filter (Burton Bloom, CACM 1970) summarising its key set. Before opening the SSTable, the read path checks the bloom filter; a negative result means the key is definitely absent and the file can be skipped. Default RocksDB configuration is 10 bits per key, giving roughly a 1% false-positive rate; tuning to 16 bits drops FPR to 0.05% at the cost of more memory. For a 1 TB dataset with 100-byte keys, the difference is 100 MB vs 160 MB of bloom-filter memory — a tiny fraction of the dataset.

Modern LSMs additionally use partitioned bloom filters (one filter per data block within an SSTable) and cuckoo filters (Fan, Andersen, Kaminsky, Mitzenmacher, 2014) for slightly better cache behaviour. RocksDB's full filter format consolidates per-block filters into a single block-aligned region, reducing memory fragmentation. The differences are significant only at the largest scale, where the bloom-filter memory itself becomes a meaningful fraction of total RAM.

Tiered storage — hot data on NVMe, warm on SATA SSD, cold on object storage like S3 — is the production answer to “dataset is bigger than my hot disks.” RocksDB's tiered storage (introduced 2021) places older SSTables on a slower medium and keeps the index in memory. ScyllaDB and Cassandra offer per-table tier configuration. The pattern for time-series data is even simpler: older time buckets sit on cheaper storage, queried only when historical analyses run.

Beyond LSM and B-tree, analytical workloads use a different family entirely: columnar formats. Apache Parquet (Twitter and Cloudera, 2013) and Apache Arrow (in-memory columnar, 2016) store values for the same column contiguously, enabling vectorised compression and SIMD scans. Snowflake, BigQuery, Redshift, ClickHouse, and DuckDB are all columnar at heart. The pattern: OLTP systems use B-trees or LSMs for point queries; OLAP systems use columnar formats for scans; transactional systems that need both run a hybrid (Snowflake's Hybrid Tables, Postgres + Citus column extensions).

EngineWrite ampRead ampSpace ampBest for
B+tree (InnoDB, Postgres)~1×~1× (height = 3–5)~1.4× (70% load)OLTP, range scans, mixed
LSM leveled (RocksDB)10–30×~L+1 with bloom~1.1×read-heavy at scale
LSM size-tiered (Cassandra)5–10×4–8 sstables/read~2×write-heavy
CoW B-tree (LMDB, bbolt)~log(n)~1×~2× with old rootsembedded, snapshot reads
Columnar (Parquet, ClickHouse)~1× (immutable)column blocks only~0.3× (compressed)analytics, scans
Disk wear matters

A consumer NVMe SSD rated for 600 TBW (TeraBytes Written) lasts 6 years under a B-tree workload of 100 GB/day — or 90 days under a leveled LSM with 30× write amp at the same logical rate. Production teams running RocksDB on cheap flash either bump up to enterprise-grade media or actively monitor the SSD's smartctl wear indicator.


Storage engine tuning — five knobs that matter

Five knobs that matter.

A storage engine is the layer of a database that persists data to disk and reads it back — the part below SQL, transactions, and replication. The two dominant designs are the B-tree (PostgreSQL, MySQL InnoDB, SQLite) and the LSM-tree (RocksDB, LevelDB, Cassandra, ScyllaDB, HBase, Pebble). They make opposite choices about where to spend time: B-trees write in place and pay on writes; LSM-trees append everywhere and pay on reads through compaction. The simulator below contrasts the two read/write paths so the trade-off is visible in real time.

Most “my database is slow” tickets resolve to one of five tuning knobs being wrong for the workload. The defaults assume a balanced read/write ratio — most production workloads aren't.

KnobEffectTune up when
Buffer pool sizeIn-memory page cacheCache hit ratio < 95% on read-heavy workload.
WAL flush modeDurability vs throughputsynchronous_commit=off trades durability for 10× write throughput on Postgres.
Compaction parallelism (LSM)CPU vs read latencyRead amplification (level count) > 4. Burns CPU but cuts reads.
Page / block sizeRead amp vs write ampSmaller (4 KB) for OLTP point lookups; larger (16–64 KB) for analytics scans.
Bloom filter bits/keyFPR vs memoryRocksDB default 10 bits ≈ 1% FPR. 16 bits ≈ 0.05%. Tune up when point-read latency dominates.
The single most useful Postgres knob

shared_buffers = 25% of RAM (default is often 128 MB — laughably small). Plus effective_cache_size = 75% of RAM (a hint to the planner about OS page cache). These two are usually the difference between a 10 ms and a 200 ms query plan on a hot dataset.

RocksDB exposes a similar set of knobs under different names. block_cache_size is the equivalent of a buffer pool; max_background_compactions controls compaction parallelism; target_file_size_base tunes SSTable size; level0_file_num_compaction_trigger controls when L0 starts compacting. The RocksDB tuning guide explicitly admits that “the answer depends on your workload, full stop,” which is correct but unhelpful for new operators.

# RocksDB Options — production-shaped read-heavy OLTP profile.
[CFOptions "default"]
  compaction_style = kCompactionStyleLevel
  level_compaction_dynamic_level_bytes = true
  write_buffer_size = 64MB           # per-cf memtable
  max_write_buffer_number = 4
  target_file_size_base = 64MB        # L1 sstable size
  max_bytes_for_level_base = 256MB    # L1 total
  max_bytes_for_level_multiplier = 10 # L2 = 10×L1, L3 = 10×L2…
  level0_file_num_compaction_trigger = 4
  level0_slowdown_writes_trigger     = 20
  level0_stop_writes_trigger         = 36
[TableOptions/BlockBasedTable "default"]
  block_size                  = 16384
  cache_index_and_filter_blocks = true
  bloom_filter_bits_per_key   = 10    # ~1% FPR
  filter_policy = "rocksdb.BuiltinBloomFilter"

Choosing a storage engine for your workload

Choosing the engine for the workload.

OLTP transaction stores with mixed read-write — user records, financial ledgers, e-commerce carts — remain B-tree territory. Postgres, MySQL, Oracle, SQL Server, SQLite. The reasons are the same as in 1972: bounded read amplification, predictable update cost, mature concurrency-control machinery (MVCC, locking, hot standby).

Write-heavy time series, telemetry, log ingestion, and event streams favour LSM — or a system built on top of one. Cassandra, ScyllaDB, InfluxDB (which has its own LSM-shaped engine called TSI), Prometheus's local TSDB, and Kafka's append-only log fall here. The workload writes 100× more than it reads; sequential write throughput matters more than read latency; LSMs absorb the sequential pattern almost optimally.

Embedded systems — mobile apps, browsers, edge devices — choose B-tree because the dataset is small, writes are infrequent, and operational simplicity matters. SQLite, BoltDB, LMDB. Embedded LSMs (LevelDB, RocksDB) exist and are used by Chrome's IndexedDB, but the operational friction of compaction is a real cost on a phone or laptop.

Analytical workloads — data warehouses, BI, ML training data — abandon both shapes for columnar formats. Snowflake, BigQuery, Redshift, ClickHouse, DuckDB. The argument: scan-heavy queries that read millions of rows but only a handful of columns benefit enormously from columnar layout, vectorised compression, and predicate pushdown. Row-store engines (B-tree or LSM) waste cache and bandwidth reading bytes the query doesn't want.

One last category worth flagging: append-only logs. Kafka stores messages in segment files, never modifying them, with a separate index for offset lookup. Apache BookKeeper does similar work for distributed write-ahead logs. Neither is “a storage engine” in the OLTP sense, but they share LSM's append-friendly philosophy and influenced its design. Kleppmann's Designing Data-Intensive Applications draws the connection explicitly in chapter 3 and chapter 11.

And one anti-pattern worth flagging: using a row-oriented OLTP database for analytical scans over many millions of rows. The query plan ends up reading 100% of every row's bytes when only a few columns are needed, blowing the buffer pool, slowing every concurrent OLTP request, and finishing in minutes rather than seconds. The right answer is a separate analytical store, replicated from the OLTP system. Maintaining that replica costs operational complexity but avoids the worst tail latencies on the transactional path.


Further reading on storage engines

Primary sources, in order.

Found this useful?