Bigtable, a sparse, sorted, persistent map.

Bigtable describes itself in one sentence: "a sparse, distributed, persistent multidimensional sorted map". Three sentences in, the paper has told you enough about the data model to imagine the API. What follows is one of the cleanest pieces of systems writing of the 2000s — and a blueprint that HBase, Cassandra and a generation of wide-column stores would copy almost verbatim.

Authors Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al

Year 2006

Venue OSDI

PDF →

TL;DR

Bigtable is a key-value store where each value is itself a structured map. Rows are sorted lexicographically and split into tablets that get distributed across servers. Writes go to a write-ahead log plus an in-memory memtable; reads merge the memtable with on-disk SSTables. Background compaction keeps the SSTable count bounded. Single-row transactions, multi-version values, and locality groups (column families on separate files) round out the design. Google ran it under Crawler, Maps, Earth, Personalised Search, Analytics, Writely — six years of production before they wrote it up.

The problem

By 2004 Google had a problem they couldn't solve with relational databases: their workloads were tens of petabytes of semi-structured data with extreme write rates and a need for predictable single-row lookups across thousands of machines. MySQL maxed out at one machine's capacity. Distributed transactions across thousands of nodes were both unnecessary (most operations touched a single row) and ruinously expensive.

They needed a storage layer that was scale-out by default, gave up multi-row ACID for raw throughput, and let different services tune the trade-offs (compression, in-memory caching, latency vs throughput) for their workload. Bigtable was that layer.

The key idea

The data model is the central insight. Every value is addressed by (row, column-family, column, timestamp). Rows are sorted strings, so ranges are cheap. Columns are grouped into families, which are the unit of access control and on-disk storage. Each cell holds multiple timestamped versions — older versions can be garbage-collected by age or by version count.

The storage engine underneath is an LSM-tree, predating the LSM-tree paper's wide adoption by a decade. Writes hit a write-ahead log on GFS plus an in-memory memtable. When the memtable fills, it gets flushed to an immutable SSTable on GFS. Reads merge the memtable with all the SSTables for the tablet, using bloom filters to skip irrelevant SSTables. Major compactions rewrite multiple SSTables into one, dropping deleted and obsolete cells.

Coordination is delegated to Chubby (the Paxos-based lock service, also a Google paper). Bigtable itself has no consensus; the master uses Chubby to elect a single master and to serialise schema changes, and tablet servers use Chubby for liveness.

The locality-group trick. A locality group is a set of column families stored together in the same SSTable. Bigtable's genius is letting different locality groups have different storage settings — one can be Snappy-compressed and pinned to RAM, another can be on-disk and compressed with gzip. The crawler used this to keep page metadata (frequently accessed) hot, while page contents (multi-MB, rarely re-read) sat compressed on disk in the same logical row.

Contributions

The data model. Sparse map with row-major sorting plus column families became the template for HBase (2008), Cassandra (2008), and Accumulo (2008). The model is rich enough to express most operational workloads, simple enough to scale linearly.

The LSM-tree at scale. The 1996 LSM-tree paper had been read by few; Bigtable made it the de facto storage engine for everything that doesn't need range scans on cold data. RocksDB, LevelDB, ScyllaDB, RocksDB-on-RocksDB stacks all descend from this design.

Tablet sharding. Range-sharded tablets that split and rebalance automatically, with metadata stored in two layers of a B+tree-style root and metadata table. The same approach drives HBase region servers and CockroachDB ranges.

Production stories. The paper is unusually candid about which services use which knobs. Personalised Search wants long memtable flush intervals; Crawler needs Snappy compression; Earth has cells that are megabytes. This kind of operational evidence is rare in research papers and made the design choices much more credible.

Criticisms and limitations

Bigtable gives up multi-row transactions. The paper acknowledges this and notes that several teams (especially Megastore, a later Google paper) added transaction layers on top. The same trade-off shaped the early NoSQL movement: scale linearly, give up joins and cross-row consistency.

The schema is rigid in subtle ways: column families are defined up front, locality groups are defined up front, timestamps are 64-bit integers chosen by the application. Bigtable on its own doesn't do secondary indexes — Megastore and later F1 and Spanner layer those on top.

GFS's single-master design propagates up: Bigtable's master is a single process with Chubby-based failover. The paper notes that the master is "lightly loaded" but in 2006 a tablet server crash storm could still overwhelm it. Modern descendants like HBase have replaced the single master with peer-to-peer designs.

Where it shows up today

Apache HBase is an almost line-for-line reimplementation of Bigtable in Java on top of HDFS. Powers Facebook Messages, Yahoo!, Airbnb event logs, and dozens of telemetry pipelines.

Apache Cassandra borrows the column-family data model and the LSM engine but uses Dynamo-style consistent hashing instead of range tablets. Used by Instagram, Apple, Discord, Reddit.

Google Cloud Bigtable is the same engine, productised. Sits underneath Spanner's storage layer and offers a managed wide-column API.

The data model lives on in Bloomberg's ComDB2, Tigris, ScyllaDB, and the storage engines of most large analytics and time-series stores.

Follow-up reading

The Google File System — Ghemawat et al · 2003 · SOSP. The storage layer underneath Bigtable. Annotated on Semicolony.
The Log-Structured Merge-Tree — O'Neil et al · 1996 · Acta Informatica. The storage engine pattern Bigtable productised. Annotated.
Cassandra — A Decentralized Structured Storage System — Lakshman & Malik · 2010 · LADIS. The Facebook (now Apache) descendant — Dynamo gossip + Bigtable data model.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services — Baker et al · 2011 · CIDR. Bigtable + Paxos transaction layer. The bridge to Spanner.
Spanner: Google's Globally Distributed Database — Corbett et al · 2012 · OSDI. The eventual replacement. Annotated.

More annotated papers

Back to the papers index

Foundational distributed-systems and database papers, read and annotated.

← All papers

Found this useful?