Spanner, planetary SQL.
Google's 2012 paper described the database that holds AdWords. Globally replicated, strongly consistent, ACID across continents — with a physical-time abstraction (TrueTime) that uses GPS and atomic clocks. The paper that made strongly-consistent distributed SQL credible.
TL;DR
Spanner is Google's globally distributed, strongly consistent SQL database. It shards
data into tablets, replicates each tablet with Paxos across datacentres, and
runs transactions with two-phase commit on top of those Paxos groups. The thing that
makes it different is TrueTime: an API that returns the current time
as an interval [earliest, latest] rather than a single value, backed by
GPS receivers and atomic clocks. At commit, a transaction picks a timestamp inside
that interval and then waits out the uncertainty before releasing locks. That wait —
usually 5 to 15 ms — buys external consistency: if T1 commits before T2
begins in real wall-clock time, T1's timestamp is strictly less than T2's. The paper
made strongly-consistent distributed SQL credible at a scale people had previously
assumed required eventual consistency.
The problem
AdWords — Google's revenue engine — ran on a sharded MySQL deployment. Sharding was manual: when a shard got too big or too hot, an engineering project moved data, rewrote application logic, and lived with months of dual-write reconciliation. Every reshard was a release event. The application code knew about shards. Cross-shard transactions were the application's problem.
The alternatives in 2010 were unappealing. Bigtable had the scale but no transactions across rows, no SQL, no schema. Megastore had transactions but with throughput described in the paper as "write rate of a few writes per second per entity group" — fine for user profiles, terrible for ad serving. NoSQL stores were eventually consistent. The teams that built on them spent enormous effort reasoning about read anomalies, write skew, and reconciliation.
Spanner's brief was concrete: take over AdWords. That meant SQL, schemas, indexes, joins, ACID transactions across shards, and synchronous geographic replication for disaster tolerance. With latency budgets that AdWords could actually meet.
The key idea
Most distributed systems treat physical clocks as untrustworthy and use logical clocks (Lamport timestamps, vector clocks, Hybrid Logical Clocks) to order events. Logical time is cheap but doesn't connect to wall-clock reality — you can totally order events but you can't say "this happened before that in real time".
Spanner inverts the assumption. It invests in hardware that lets the system trust
real time, within a known bound. Every datacentre runs a set of time master
machines, each with either a GPS receiver or an atomic clock (with the two
deliberately disagreeing failure modes, so a fault in one is caught by the other).
Every spanserver runs a timeslave daemon that polls the masters, applies
Marzullo's algorithm, and exposes the result as TT.now() returning
{earliest, latest} — an interval that, with overwhelmingly high
probability, brackets actual physical time.
// TrueTime API
TT.now() → {earliest, latest} // current time, as an interval
TT.after(t) → bool // definitely after t? earliest > t
TT.before(t) → bool // definitely before t? latest < t
// ε = (latest - earliest) / 2 typically 1–7 ms in productionThe width of that interval (the paper calls it 2ε, ε being half-width) is the uncertainty budget. In production it ran around 1 ms most of the time, climbing to about 7 ms when a time master had been unreachable for a while. The paper plots the distribution and shows it is heavily concentrated at the low end.
The commit wait is where the trick pays off. When a transaction commits,
the coordinator picks a timestamp s ≥ TT.now().latest and then waits
until TT.now().earliest > s before releasing the transaction's
locks. By the time another transaction can possibly see this one's effects, real
time has provably advanced past s. So any later transaction in real
time gets a strictly greater timestamp. This is external consistency: a real-time
total order on commits, not just a serialisable schedule.
Contributions
- A globally distributed, strongly-consistent database, in production. Before Spanner, the prevailing wisdom (Brewer's CAP, Vogels-era eventual consistency, the NoSQL backlash against ACID) was that strict serializability across continents was either impossible or unaffordable. Spanner ran AdWords on it.
- TrueTime as a first-class primitive. Physical time exposed as a
bounded interval, not a scalar. The API is small (
TT.now,TT.after,TT.before) but the implication — that uncertainty is explicit and finite — restructures how the rest of the system is built. - External consistency as the consistency target. Strict serializability plus a real-time order on non-overlapping transactions. Stronger than serializability, which only guarantees some equivalent serial order.
- The commit-wait technique. A protocol-level recipe for turning
clock uncertainty into a correctness property: pick a timestamp past
latest, wait untilearliestovertakes it, release locks. Cheap, local, and provably correct given TrueTime's bound. - Sharded SQL with full ACID, including joins and secondary indexes.
The schema language adds
INTERLEAVE IN PARENTfor hierarchical co-location, which lets parent and child rows sit in the same Paxos group and makes the common cross-table transaction effectively single-shard. Snapshot reads at a chosen timestamp serve analytical queries without locks. - Lock-free snapshot reads at arbitrary past timestamps. Pick any
tin the past; the system reads a consistent snapshot as oftwith no coordination, because tablets retain enough version history.
Criticisms and caveats
- TrueTime needs specialised hardware. GPS antennas on the roof of every datacentre, with atomic-clock backups, and a tight operational discipline to keep ε small. Most companies can't deploy this — which is why the systems that followed (CockroachDB, YugabyteDB) had to find a software-only substitute.
- Commit wait adds latency to every write. Typically 5 to 15 ms. That's tolerable for AdWords and most OLTP workloads, but for very latency-sensitive paths it is a real cost — a write that would otherwise complete in 1 ms now takes an order of magnitude longer.
- Two-phase commit across Paxos groups is expensive. Each participant runs Paxos to durably log its vote, so a multi-shard transaction is a cross-product of Paxos rounds. The paper is candid that most production traffic is single-shard; the architecture optimises for the common case and accepts the tail cost for the rare one.
- The Paxos described is not textbook Paxos. It's Multi-Paxos with time-based leader leases, pipelined log proposals, and a few other engineering details. The paper sketches them rather than specifying them; the reader has to take some details on faith. The Raft paper (Ongaro, 2014), partly motivated by this kind of ambiguity, is much more precise.
- External consistency is overkill for some workloads. Plenty of applications don't need a real-time total order; they just need session causality or read-your-writes. Paying commit wait for ordering you don't use is wasted budget.
Where it shows up today
| System | Relationship to Spanner |
|---|---|
| CockroachDB | Explicitly modelled on Spanner, but uses Hybrid Logical Clocks rather than TrueTime so it runs on commodity NTP. Trades a tighter clock guarantee for portability — readers have to handle uncertainty windows in software with ReadWithinUncertaintyIntervalError retries. |
| YugabyteDB | Also Spanner-inspired, also HLC-based. PostgreSQL-compatible front end on top of a DocDB layer using Raft per tablet. |
| TiDB | Percolator-style transactions (an earlier Google paper) plus Raft replication. Same overall shape — sharded SQL with global ACID — different consensus and ordering recipe. |
| Cloud Spanner / AlloyDB | Spanner is now a Google Cloud product. AlloyDB is Google's managed PostgreSQL, backed by Spanner-style replicated storage underneath. |
| F1 | The SQL layer that runs AdWords on Spanner. The companion paper (Shute et al, 2013) describes the query engine, online schema changes, and how the migration from MySQL actually happened. |
| FoundationDB | Different lineage (deterministic simulation testing, sequencer-based ordering) but reaches the same target: strict serializability across shards. A useful contrast point if you've read Spanner — they solve the same problem with very different machinery. |
The broader influence is harder to pin to one system. After 2012, "globally consistent SQL" stopped being a hypothetical, and the design space for new transactional databases shifted. The pattern of Paxos/Raft per shard, 2PC across shards, snapshot reads at an MVCC timestamp is now close to a default for new distributed SQL engines.
Follow-up reading
- Shute et al, F1: A Distributed SQL Database That Scales (2013) — the companion paper. F1 is the SQL layer that drives AdWords; this is the half of the story Spanner doesn't tell.
- Bacon et al, Spanner: Becoming a SQL System (2017) — five years later. How the query optimiser, schema management, and operational story evolved as Spanner became a general-purpose database.
- Kulkarni et al, Logical Physical Clocks and Consistent Snapshots (2014) — Hybrid Logical Clocks. The software-only answer to TrueTime that CockroachDB and YugabyteDB ended up adopting.
- Helland, Building on Quicksand (2009) — the philosophical backdrop. Why distributed systems used to give up on strong consistency, and what assumptions Spanner had to break.
- Paxos Made Simple — Lamport, annotated — the underlying consensus protocol. Spanner uses Multi-Paxos with leases; this is the canonical reference.
- Time, Clocks, and the Ordering of Events — Lamport, annotated — the original argument for logical clocks. Read it to understand what Spanner is rejecting and why TrueTime is interesting.
- F1 — Shute et al 2013, annotated — companion annotation, the SQL side of the story.