Distributed SQL
Spanner, CockroachDB, YugabyteDB, TiDB: a single SQL endpoint that scales horizontally, survives node loss without losing data, and runs ACID transactions across any rows in the cluster. What sharded MySQL gave up to scale, distributed SQL aims to give back. Here is the shape of the architecture, the clock problem, and the price you pay on every commit.
The promise — SQL that doesn't fit on one box.
A traditional MySQL or Postgres deployment fits as much as one machine's CPU, RAM, and disk will hold. When you outgrow it, the standard answer for the last twenty years has been to shard at the application layer: split users by ID, route the right query to the right shard, write code to re-shard when one fills up. ACID stays inside a shard, and anything cross-shard becomes the application's problem.
Distributed SQL aims to give back what sharded MySQL took away. A single SQL endpoint that scales horizontally as you add nodes. Replication that survives losing a node — or a whole datacenter — without losing data. ACID transactions across any rows in the cluster, not just inside an arbitrary partition. Joins that work. Foreign keys that work. The user-facing contract stays the same; the topology underneath changes.
The price is real and shows up on every commit, typically 5–15 ms versus single-node MySQL's sub-millisecond commit. But for OLTP workloads that don't need extreme tail latency, the operational simplicity (no manual sharding, automatic failover, cross-region replicas as a config knob) is usually worth it.
Stateless SQL on top of a distributed KV.
Every modern distributed SQL system is built in two layers. A stateless SQL layer handles parsing, planning, and distributed execution. Underneath, a distributed key-value store handles sharding, replication, and consensus. Spanner has this shape — F1 is the SQL layer, Spanner itself is the KV store. CockroachDB folds both into one binary but keeps the internal boundary. TiDB explicitly separates them: TiDB the SQL server, TiKV the storage. YugabyteDB has YSQL on top of DocDB.
The split matters because the two layers have different scaling stories. The SQL layer is stateless: add nodes to handle more concurrent queries. The KV layer is stateful: add nodes to hold more data, and the system rebalances ranges to use them. You can scale each independently.
Range sharding, not hash.
Distributed databases split data either by hashing the key (each shard owns a random
fraction of the keyspace) or by ranges (each shard owns a contiguous interval, like
[users/A, users/F)). Hash sharding spreads load evenly and was the default for
NoSQL stores like DynamoDB and Cassandra. Range sharding preserves locality — rows with
adjacent keys live on the same shard, so a WHERE id BETWEEN 100 AND 200 scan
hits one shard instead of fanning out to all of them.
Distributed SQL systems almost universally pick range sharding because SQL workloads do a lot of range scans (ORDER BY + LIMIT, foreign-key joins, secondary index lookups). The downside, hot ranges from monotonic keys, is solved by auto-splitting. CockroachDB and Spanner watch range size and access patterns. A range that exceeds ~512 MB or absorbs too much traffic gets split in half. Cold neighbouring ranges merge back.
Raft per range — many tiny consensus groups.
Each range is replicated across three or five nodes via a consensus protocol — Raft in CockroachDB, YugabyteDB, and TiKV; Paxos in Spanner. The replicas form a per-range consensus group. Writes go through the leader, which appends to the Raft log and waits for a quorum of followers to ack before considering the write committed. Failure of one replica is tolerated transparently; failure of the leader triggers a Raft election that usually completes in under a second.
Reads have a choice. The leader can serve a linearisable read, the strongest guarantee, but it has to confirm leadership with a heartbeat first. Any replica can serve a snapshot read at a slightly older timestamp, a weaker guarantee, but no extra round-trip. Production workloads usually mix both: balance-check from the leader, analytics dashboard from a follower.
A mid-size cluster has tens of thousands of ranges, which means tens of thousands of independent Raft groups. The leader election and heartbeat traffic for that many groups is non-trivial — CockroachDB and TiKV both implement Raft batching to amortise heartbeats across groups that share the same nodes.
The clock problem — TrueTime vs HLC.
SQL transactions need a total order. If transaction T1 commits before T2 starts, any read in T2 must see T1's writes. On a single node, the database keeps a logical counter and everyone agrees. Across nodes, you need a shared notion of time, and machine clocks drift.
Spanner's answer is hardware. Google deployed GPS receivers and atomic clocks in every
datacenter and exposed TrueTime: an API that returns not a single timestamp but an
interval [earliest, latest] guaranteed to contain real time. The uncertainty
window ε is small, typically 1–7 ms in production, but non-zero, and the
transaction protocol explicitly waits it out. See our Spanner
summary for the full mechanism.
CockroachDB and YugabyteDB don't get to ship GPS receivers with their software, so they use Hybrid Logical Clocks instead, a pure-software scheme that combines NTP wall time with a Lamport-style logical counter. HLC gives you sub-millisecond uncertainty without special hardware, at the cost of occasionally needing to abort and retry transactions that fall inside the uncertainty window. F1 showed how the SQL layer on top consumes these timestamps.
Distributed transactions and the commit tax.
A transaction touching N ranges is a distributed transaction. The classical protocol is two-phase commit: a prepare round (each participant locks the rows and acks), then a commit round (each participant durably commits). Two round-trips. On a single-region cluster with 1 ms inter-node RTT, that's a 4 ms floor before you've done any real work.
Modern systems aggressively optimise this. CockroachDB's parallel commits protocol eliminates the prepare round for the common case. The transaction is considered committed as soon as the writes are durably staged, and the commit record is asynchronously cleaned up. Result: 1-RTT distributed transactions in the happy path, comparable to a single-range write. Spanner uses commit wait: pause for the ε interval after acquiring locks before releasing them, which serialises any reader that observes those writes.
Calvin takes a different path entirely: order transactions before executing them, via a deterministic sequencer. Once the order is fixed, every replica can execute independently with no further coordination. See our Calvin summary — FaunaDB productionised this approach.
Distributed query execution.
The SQL layer parses the query, builds a logical plan, then turns it into a physical plan
made of operators: scan, filter, hash-join, aggregate, sort. In a distributed system the
planner has one extra job: decide which node runs which operator. The default is to push
operators down to where the data lives. A filter on users.country = 'IN'
runs on the nodes that hold those rows, so only the matching rows cross the network.
Joins are where distributed planners earn their keep. A hash join between two large tables that live on different nodes can either ship one table to the other (expensive if both are big) or shuffle both by the join key (cheaper if you have enough nodes). The cost model has to predict network round-trips, not just disk reads. A planner that ignores network cost picks pathologically bad plans on a distributed cluster.
CockroachDB's EXPLAIN ANALYZE (DISTSQL) and TiDB's EXPLAIN ANALYZE
both show per-operator placement and network bytes shipped. They are the right tools to
reach for when a query that worked on a single node suddenly takes ten seconds on a
cluster.
Online schema changes.
Adding a column or building an index on a sharded MySQL deployment historically meant
locking the table for minutes, or using something like pt-online-schema-change
that copies the table behind your back. Distributed SQL systems do schema changes online,
without blocking writes, using a state machine first described in the F1 paper:
DELETE_ONLY → WRITE_ONLY → REORG → READ_WRITE. Each replica advances through the
states asynchronously; the index is invisible until every replica is in READ_WRITE.
The details are subtle. See our F1 summary for the full state machine and why each intermediate state exists. CockroachDB, TiDB, and YugabyteDB all implement variants of the same protocol.
Schema and locality — where rows live.
A distributed SQL system gives you knobs at the schema level to control which rows live in
which region. CockroachDB's REGIONAL BY ROW and Spanner's INTERLEAVE
are typical examples. The point is to keep a user's data close to that user, so a read
from London hits the EU replica instead of crossing the Atlantic.
-- CockroachDB: a row's region is a column
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
region crdb_internal_region NOT NULL,
email STRING NOT NULL,
created TIMESTAMPTZ DEFAULT now()
) LOCALITY REGIONAL BY ROW AS region;
-- a read from EU-WEST stays in EU-WEST:
SELECT * FROM users WHERE id = $1 AND region = 'eu-west-1';The placement constraint is enforced at the range level — rows with region =
'eu-west-1' live in a range whose leaseholder is in EU-WEST. Cross-region reads still
work, but they pay the round-trip. Spanner does the same with named placement policies, and TiDB
has placement rules at the table partition level.
-- CockroachDB's parallel commits — the happy path is 1 RTT
BEGIN;
UPSERT INTO accounts (id, balance) VALUES (1, 100); -- range A
UPSERT INTO accounts (id, balance) VALUES (2, 200); -- range B
COMMIT;
-- internally, the transaction coordinator:
-- 1. stages both writes in parallel (1 Raft RTT)
-- 2. writes the transaction record as STAGING
-- 3. returns COMMIT to the client immediately
-- 4. asynchronously marks the txn record as COMMITTED
-- no separate PREPARE round → 1 RTT vs classic 2PC's 2 RTTsThe vendor landscape.
| System | SQL dialect | Clock | Notes |
|---|---|---|---|
| Spanner | Google SQL / Postgres | TrueTime (GPS + atomic) | Managed only. Externally consistent global transactions. Expensive but the gold standard. |
| CockroachDB | Postgres-compatible | HLC | Self-hostable. Parallel commits. Strong locality controls (REGIONAL BY ROW). |
| YugabyteDB | YSQL (Postgres) + YCQL (Cassandra) | HLC | Reuses Postgres's query layer wholesale. Dual API for migrations from Cassandra. |
| TiDB | MySQL-compatible | Centralised PD timestamps | Separate row engine (TiKV) and column engine (TiFlash). Strong story for HTAP. |
| AlloyDB | Postgres | Hybrid | Postgres front-end on Spanner-style disaggregated storage. Single-region focus. |
| FaunaDB | FQL / GraphQL | Calvin-style deterministic order | Order-then-execute, no 2PC. Strong consistency without commit-wait. |
When not to reach for distributed SQL.
- Single-region OLTP under ~50k QPS. A well-tuned Postgres or MySQL instance with a read replica handles this comfortably at lower latency and cost. The operational complexity of a distributed cluster is overhead you don't need.
- Analytical workloads. Distributed SQL is optimised for short OLTP transactions. For multi-second aggregations over billions of rows, a column store (ClickHouse, BigQuery, Snowflake, TiFlash) is one or two orders of magnitude faster.
- Microservice-per-database patterns. If each service owns a small, isolated dataset with no cross-service joins, the simplicity of one Postgres per service usually beats a shared distributed cluster.
- Workloads dominated by tight write loops. The 5–15 ms commit tax adds up when a request commits ten times. Either batch the writes into one transaction, or use a single-node database that fsyncs in under a millisecond.
Further reading.
- Spanner — Google's Globally-Distributed Database (OSDI 2012) — TrueTime, commit-wait, and the externally-consistent transaction protocol.
- F1 — A Distributed SQL Database That Scales (VLDB 2013) — the SQL layer on top of Spanner, online schema changes, and what it took to migrate Google AdWords off MySQL.
- Calvin — Fast Distributed Transactions for Partitioned Database Systems (SIGMOD 2012) — deterministic ordering as an alternative to 2PC. The basis of FaunaDB.
- CockroachDB — the Resilient Geo-Distributed SQL Database (SIGMOD 2020) — parallel commits, range leases, HLC clock model, and the production architecture.
- TiDB — a Raft-based HTAP Database (VLDB 2020) — how TiDB splits OLTP (TiKV) and OLAP (TiFlash) on the same Raft log.
- YugabyteDB — distributed SQL architecture — how YSQL reuses Postgres's query layer and what they had to rewrite.
- CockroachLabs — Parallel Commits, explained — the engineering blog post that walks through the optimisation step by step.