The log-structured merge-tree, the paper under every modern key-value store.

In 1996 this read as a niche result: a clever index for History tables, written for a world of 1,000-dollar disk arms. The database community mostly moved on. Ten years later Bigtable built its storage layer on exactly this shape, LevelDB open-sourced it, and today the LSM-tree sits under RocksDB, Cassandra, ScyllaDB, HBase, and most of the embedded KV ecosystem. The original PDF is freely hosted by Patrick O'Neil's UMass Boston page — read it here — and below is a guide to what it actually says, versus what production systems later made of it.

Authors Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil

Year 1996

Venue Acta Informatica 33

Original PDF →

TL;DR

A B-tree updates records in place, which makes every insert a random disk write. For a fast-growing, index-everything workload, the disk arm becomes the cost of the whole system. The LSM-tree's answer: never update in place. Buffer recent writes in a memory component (C₀), and migrate them to one or more disk components (C₁, …, C_K) of geometrically increasing size through a continuous background merge, "reminiscent of merge sort". All disk writes become large sequential transfers; the cost of indexing an insert drops by the batching factor of the merge. Reads pay instead: a lookup may have to check every component, newest first.

If you want the mechanics interactively before the maths, the site has a step-through LSM-tree simulator and a study page on LSM internals as they exist in production engines. This page stays close to the paper itself.

The problem: the History table

The paper opens with a deliberately mundane example, a modified TPC-A bank workload running 1,000 transactions per second. Each transaction appends a 50-byte row to a History table, and the business wants to query recent history by account. That means maintaining an index on Account-ID over a table that grows by 1,000 rows a second, with 100 million possible accounts. Each index insert lands on an effectively random B-tree leaf.

The arithmetic is brutal and the paper does it openly. A leaf page of that index gets touched far too rarely to stay cached, so by the five-minute rule (Gray and Putzolu's result on when a page is worth buffering in memory rather than re-reading from disk) the working set cannot live in memory, and every insert costs real I/O: read the leaf, write the leaf. Maintaining the index in real time roughly doubles the I/O cost of the transaction and, on the paper's price model, inflates total system cost by up to fifty percent. You end up buying disk arms for their seek capacity while their storage capacity sits empty.

The framing question: disks are cheap per byte and expensive per seek, and sequential transfer is enormously cheaper per page than random access. Can an index be organised so that its maintenance consumes sequential bandwidth instead of seeks?

The two-component LSM-tree and the rolling merge

The basic design has two components. C₀ lives entirely in memory and is not page-structured at all; the paper suggests an AVL or (2-3) tree, since there is no reason to pay disk-page geometry for a structure that never touches disk. C₁ lives on disk and looks like a B-tree with the slack removed: nodes packed 100% full, leaf sequences laid out in contiguous 256 KB multi-page blocks so a single arm movement reads or writes many pages.

Inserts go to C₀, a memory operation. Durability comes from the transaction log the system is already writing; if the machine crashes, recent index entries are reconstructed from that log. This is the memtable-plus-write-ahead-log arrangement of every modern engine, described in 1996 under different names.

The interesting machinery is the rolling merge. When C₀ approaches its size budget, a merge cursor circulates continuously through the key space: it reads a multi-page block of C₁ leaves into memory (the "emptying block"), merges in the C₀ entries for that key range, and writes out freshly packed leaves to a new position on disk (the "filling block"). Old blocks are invalidated only after the new ones are safely written, which doubles as the crash-recovery story. When the cursor reaches the end of the key space it wraps around and starts again. Deletes are handled by inserting a tombstone (the paper calls it a delete node entry) into C₀, which annihilates the real entry when the cursor reaches it.

Reads in the original design. A find checks C₀, then C₁ (then C₂, … in the multi-component version), newest first, until the key or a tombstone turns up. There are no Bloom filters anywhere in the paper; a miss really does search every component. The authors are upfront that this only pays off "in applications where index inserts are more common than finds", which described History tables in 1996 and describes write-heavy ingest paths now. What changed in between is that production systems found cheap ways to make the reads tolerable too.

Multiple components and the size-ratio theorem

Two components have a tuning problem: merge efficiency is proportional to the size ratio between C₀ and C₁, so making merges cheap means making C₀ huge, and C₀ is bought in RAM. The fix is to interpose disk components: C₀, C₁, C₂, …, C_K, with an independent rolling merge running between every adjacent pair. Each entry now gets rewritten once per level it passes through, but every rewrite is a batched sequential transfer, and C₀ can stay small.

Then comes the result that quietly designed every storage engine you use. Theorem 3.1: with the memory budget S₀ and the largest component size S_K fixed, the total merge I/O rate H is minimised when all adjacent size ratios r_i = S_i/S_i−1 are equal to a single constant r. Component sizes should form a geometric progression, and the total I/O works out to H = (2R/S_p)·(K·(1 + r) − 1/2) for insert rate R and page size S_p.

When LevelDB makes each level 10× the size of the one above it, that 10 is a chosen point on this curve. When RocksDB's tuning guide tells you that more levels mean less space amplification but more write amplification, it is restating the trade between K and r. The theorem is the reason "level multiplier" is a single number in every config file rather than a per-level schedule.

The cost argument against B-trees

Section 3 builds a deliberately economic model: COST_P is the disk-arm cost of providing one random page I/O per second, COST_π the much smaller cost of a page I/O delivered as part of a multi-page block, and COST_m the cost of a megabyte of memory. The B-tree pays COST_P twice per insert. The LSM-tree pays COST_π a handful of times per entry, divided by the batching factor of the merge, because each merge page carries many new entries at once.

This is the same lineage of reasoning as the five-minute rule, and the paper leans on it explicitly, in the introduction to show the History index cannot be buffered away, and again in the conclusions to answer the obvious objection: why not just give the B-tree the same memory as a page cache? The answer is that buffering only removes the read before the write; the write itself is still a random page I/O for a single entry. The LSM-tree wins on two multiplied factors that no cache provides: full pages of entries per write, and multi-page blocks per arm movement.

The honest caveat, also from the paper: the advantage is an advantage in disk arm cost. Where capacity rather than seek rate is the binding constraint, the B-tree is fine. The paper is a claim about a price ratio, not a universal ordering of data structures, which is exactly why its conclusions had to be re-derived when the hardware under it changed.

What survived, and what didn't

Reading this in 2026, the striking thing is how much of the skeleton is intact and how completely the flesh was replaced.

Survived: the architecture. Memory buffer fed by a write-ahead log, immutable sorted runs on disk, background merging, geometric level sizing, deletes as tombstones, the explicit trade of read cost for write cost. That is the whole modern LSM engine, and it is all in this paper.

Replaced: the rolling merge. No production system implements the cursor that rewrites a single B-tree-like C₁ in place. Bigtable swapped it for whole-file granularity: flush C₀ as an immutable sorted file (the SSTable), and run compaction jobs that read some files and write new ones. Immutable files make crash recovery trivial, snapshots free, and concurrency a matter of reference counting, all things the paper spends its hardest sections (4 on concurrency and recovery) labouring over for the in-place design. LevelDB carried the SSTable pattern into open source and it became the default everywhere.

Added: Bloom filters. The paper's reads search every component on a miss. Modern engines attach a Bloom filter to each run, so a point lookup skips nearly every file that cannot contain the key. This single addition is what turned the LSM-tree from "write-optimised index for History tables" into a general-purpose engine; Monkey (Dayan et al., SIGMOD 2017) later showed how to size the filters per level optimally.

Refined: the size-ratio analysis became the compaction-strategy menu. Levelled compaction (LevelDB, RocksDB's default) keeps one sorted run per level: better reads and space, more write amplification. Size-tiered compaction (Cassandra's default, RocksDB universal) lets several similar-sized runs accumulate before merging: cheaper writes, worse reads and space. Both are points in the design space the paper's Section 3 maths opened up, and papers like Dostoevsky (SIGMOD 2018) explore the continuum between them.

Reweighted: the hardware argument. The paper's enemy is the disk arm. On SSDs random reads are cheap, which weakens the original motivation, but write amplification now costs flash endurance and bandwidth, so the same merge accounting matters for a different reason. The trade survived its own premise.

Where it shows up today

LevelDB — Google's embedded engine, the first widely used open-source LSM, and the codebase most readers meet the design in.

RocksDB — Facebook's LevelDB fork, now the substrate for MyRocks, TiKV, Kafka Streams state stores, Flink state backends, and a long tail of infrastructure. Its tuning guide is essentially applied Section 3.

Cassandra and ScyllaDB — wide-column stores that pair Dynamo's ring for distribution with an LSM engine for storage; Cassandra's size-tiered default is the tiered branch of the family tree.

HBase, Pebble (CockroachDB's RocksDB rewrite in Go), Badger, and the state databases of most blockchain clients. If a system advertises high sustained write throughput on a single node, it is overwhelmingly likely to be one of these.

For how the pieces fit together mechanically, the site's LSM-tree internals page covers memtables, SSTables, compaction and tombstones as production engines do them, and the LSM-tree simulator lets you watch flushes and compactions happen.

Follow-up reading

The Log-Structured Merge-Tree (LSM-Tree) — O'Neil, Cheng, Gawlick & O'Neil · 1996 · Acta Informatica. The paper itself. Sections 1 to 3 carry the ideas; Sections 4 and 5 are mostly machinery the SSTable made unnecessary.
Bigtable: A Distributed Storage System for Structured Data — Chang et al · 2006 · OSDI. The system that turned the rolling merge into SSTables plus compaction and took the design to production scale. Annotated.
LSM-based Storage Techniques: A Survey — Luo & Carey · 2020 · VLDB Journal. The 25-year retrospective: compaction variants, filter alternatives, new-hardware adaptations.
Monkey: Optimal Navigable Key-Value Store — Dayan, Athanassoulis & Idreos · 2017 · SIGMOD. Per-level Bloom-filter budgeting; the cleanest modern treatment of LSM read cost.
WiscKey: Separating Keys from Values in SSD-Conscious Storage — Lu et al · 2016 · FAST. Keeps the LSM for keys, moves large values to a log; the write-amplification fix for big-value workloads.
ARIES: A Transaction Recovery Method — Mohan et al · 1992 · TODS. The in-place-update world the LSM-tree was reacting against, and the recovery discipline both worlds share through the log. Annotated.

More annotated papers

Back to the papers index

Foundational distributed-systems and database papers, read and annotated.

← All papers ↑ LSM-tree internals, the study page → LSM-tree simulator

Found this useful?