09 / 14

Internals / 09

Distributed SQL

Spanner, CockroachDB, YugabyteDB, TiDB: a single SQL endpoint that scales horizontally, survives node loss without losing data, and runs ACID transactions across any rows in the cluster. What sharded MySQL gave up to scale, distributed SQL aims to give back. Here is the shape of the architecture, the clock problem, and the price you pay on every commit.

The promise — SQL that doesn't fit on one box.

A traditional MySQL or Postgres deployment fits as much as one machine's CPU, RAM, and disk will hold. When you outgrow it, the standard answer for the last twenty years has been to shard at the application layer: split users by ID, route the right query to the right shard, write code to re-shard when one fills up. ACID stays inside a shard, and anything cross-shard becomes the application's problem.

Distributed SQL aims to give back what sharded MySQL took away. A single SQL endpoint that scales horizontally as you add nodes. Replication that survives losing a node — or a whole datacenter — without losing data. ACID transactions across any rows in the cluster, not just inside an arbitrary partition. Joins that work. Foreign keys that work. The user-facing contract stays the same; the topology underneath changes.

The price is real and shows up on every commit, typically 5–15 ms versus single-node MySQL's sub-millisecond commit. But for OLTP workloads that don't need extreme tail latency, the operational simplicity (no manual sharding, automatic failover, cross-region replicas as a config knob) is usually worth it.

Stateless SQL on top of a distributed KV.

Every modern distributed SQL system is built in two layers. A stateless SQL layer handles parsing, planning, and distributed execution. Underneath, a distributed key-value store handles sharding, replication, and consensus. Spanner has this shape — F1 is the SQL layer, Spanner itself is the KV store. CockroachDB folds both into one binary but keeps the internal boundary. TiDB explicitly separates them: TiDB the SQL server, TiKV the storage. YugabyteDB has YSQL on top of DocDB.

The split matters because the two layers have different scaling stories. The SQL layer is stateless: add nodes to handle more concurrent queries. The KV layer is stateful: add nodes to hold more data, and the system rebalances ranges to use them. You can scale each independently.

The KV interface is narrower than you'd think. Get, put, scan, conditional put, and a transactional batch. Everything SQL needs — index lookups, range scans, joins, updates, deletes — compiles down to that small set. The SQL layer's job is mostly translating a query plan into the right sequence of KV operations.

Range sharding, not hash.

Distributed databases split data either by hashing the key (each shard owns a random fraction of the keyspace) or by ranges (each shard owns a contiguous interval, like [users/A, users/F)). Hash sharding spreads load evenly and was the default for NoSQL stores like DynamoDB and Cassandra. Range sharding preserves locality — rows with adjacent keys live on the same shard, so a WHERE id BETWEEN 100 AND 200 scan hits one shard instead of fanning out to all of them.

Distributed SQL systems almost universally pick range sharding because SQL workloads do a lot of range scans (ORDER BY + LIMIT, foreign-key joins, secondary index lookups). The downside, hot ranges from monotonic keys, is solved by auto-splitting. CockroachDB and Spanner watch range size and access patterns. A range that exceeds ~512 MB or absorbs too much traffic gets split in half. Cold neighbouring ranges merge back.

Raft per range — many tiny consensus groups.

Each range is replicated across three or five nodes via a consensus protocol — Raft in CockroachDB, YugabyteDB, and TiKV; Paxos in Spanner. The replicas form a per-range consensus group. Writes go through the leader, which appends to the Raft log and waits for a quorum of followers to ack before considering the write committed. Failure of one replica is tolerated transparently; failure of the leader triggers a Raft election that usually completes in under a second.

Reads have a choice. The leader can serve a linearisable read, the strongest guarantee, but it has to confirm leadership with a heartbeat first. Any replica can serve a snapshot read at a slightly older timestamp, a weaker guarantee, but no extra round-trip. Production workloads usually mix both: balance-check from the leader, analytics dashboard from a follower.

A mid-size cluster has tens of thousands of ranges, which means tens of thousands of independent Raft groups. The leader election and heartbeat traffic for that many groups is non-trivial — CockroachDB and TiKV both implement Raft batching to amortise heartbeats across groups that share the same nodes.

The clock problem — TrueTime vs HLC.

SQL transactions need a total order. If transaction T1 commits before T2 starts, any read in T2 must see T1's writes. On a single node, the database keeps a logical counter and everyone agrees. Across nodes, you need a shared notion of time, and machine clocks drift.

Spanner's answer is hardware. Google deployed GPS receivers and atomic clocks in every datacenter and exposed TrueTime: an API that returns not a single timestamp but an interval [earliest, latest] guaranteed to contain real time. The uncertainty window ε is small, typically 1–7 ms in production, but non-zero, and the transaction protocol explicitly waits it out. See our Spanner summary for the full mechanism.

CockroachDB and YugabyteDB don't get to ship GPS receivers with their software, so they use Hybrid Logical Clocks instead, a pure-software scheme that combines NTP wall time with a Lamport-style logical counter. HLC gives you sub-millisecond uncertainty without special hardware, at the cost of occasionally needing to abort and retry transactions that fall inside the uncertainty window. F1 showed how the SQL layer on top consumes these timestamps.

HLC vs TrueTime in one line. TrueTime makes the uncertainty window bounded by hardware (ε ≈ 1–7 ms) and pays it as commit-wait latency. HLC makes the uncertainty window unbounded in principle but typically sub-millisecond, and pays it as transaction restarts when conflicts land inside the window. Both end up at similar latencies for single-region workloads; TrueTime pulls ahead for cross-region externally-consistent reads.

Distributed transactions and the commit tax.

A transaction touching N ranges is a distributed transaction. The classical protocol is two-phase commit: a prepare round (each participant locks the rows and acks), then a commit round (each participant durably commits). Two round-trips. On a single-region cluster with 1 ms inter-node RTT, that's a 4 ms floor before you've done any real work.

Modern systems aggressively optimise this. CockroachDB's parallel commits protocol eliminates the prepare round for the common case. The transaction is considered committed as soon as the writes are durably staged, and the commit record is asynchronously cleaned up. Result: 1-RTT distributed transactions in the happy path, comparable to a single-range write. Spanner uses commit wait: pause for the ε interval after acquiring locks before releasing them, which serialises any reader that observes those writes.

Calvin takes a different path entirely: order transactions before executing them, via a deterministic sequencer. Once the order is fixed, every replica can execute independently with no further coordination. See our Calvin summary — FaunaDB productionised this approach.

Why distributed SQL costs 5–15 ms per commit. A single-node Postgres commit is one fsync (~1 ms on SSD). A distributed SQL commit is one Raft round-trip (~2–5 ms in-region) plus, if the transaction crosses ranges, a 2PC coordination (another 2–5 ms), plus any clock-uncertainty wait. That's the tax. For OLTP workloads at moderate scale you barely notice; for tight write loops it matters.

Distributed query execution.

The SQL layer parses the query, builds a logical plan, then turns it into a physical plan made of operators: scan, filter, hash-join, aggregate, sort. In a distributed system the planner has one extra job: decide which node runs which operator. The default is to push operators down to where the data lives. A filter on users.country = 'IN' runs on the nodes that hold those rows, so only the matching rows cross the network.

Joins are where distributed planners earn their keep. A hash join between two large tables that live on different nodes can either ship one table to the other (expensive if both are big) or shuffle both by the join key (cheaper if you have enough nodes). The cost model has to predict network round-trips, not just disk reads. A planner that ignores network cost picks pathologically bad plans on a distributed cluster.

CockroachDB's EXPLAIN ANALYZE (DISTSQL) and TiDB's EXPLAIN ANALYZE both show per-operator placement and network bytes shipped. They are the right tools to reach for when a query that worked on a single node suddenly takes ten seconds on a cluster.

Online schema changes.

Adding a column or building an index on a sharded MySQL deployment historically meant locking the table for minutes, or using something like pt-online-schema-change that copies the table behind your back. Distributed SQL systems do schema changes online, without blocking writes, using a state machine first described in the F1 paper: DELETE_ONLY → WRITE_ONLY → REORG → READ_WRITE. Each replica advances through the states asynchronously; the index is invisible until every replica is in READ_WRITE.

The details are subtle. See our F1 summary for the full state machine and why each intermediate state exists. CockroachDB, TiDB, and YugabyteDB all implement variants of the same protocol.

Schema and locality — where rows live.

A distributed SQL system gives you knobs at the schema level to control which rows live in which region. CockroachDB's REGIONAL BY ROW and Spanner's INTERLEAVE are typical examples. The point is to keep a user's data close to that user, so a read from London hits the EU replica instead of crossing the Atlantic.

-- CockroachDB: a row's region is a column
CREATE TABLE users (
  id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  region    crdb_internal_region NOT NULL,
  email     STRING NOT NULL,
  created   TIMESTAMPTZ DEFAULT now()
) LOCALITY REGIONAL BY ROW AS region;

-- a read from EU-WEST stays in EU-WEST:
SELECT * FROM users WHERE id = $1 AND region = 'eu-west-1';

The placement constraint is enforced at the range level — rows with region = 'eu-west-1' live in a range whose leaseholder is in EU-WEST. Cross-region reads still work, but they pay the round-trip. Spanner does the same with named placement policies, and TiDB has placement rules at the table partition level.

-- CockroachDB's parallel commits — the happy path is 1 RTT
BEGIN;
  UPSERT INTO accounts (id, balance) VALUES (1, 100);  -- range A
  UPSERT INTO accounts (id, balance) VALUES (2, 200);  -- range B
COMMIT;

-- internally, the transaction coordinator:
--   1. stages both writes in parallel (1 Raft RTT)
--   2. writes the transaction record as STAGING
--   3. returns COMMIT to the client immediately
--   4. asynchronously marks the txn record as COMMITTED
-- no separate PREPARE round → 1 RTT vs classic 2PC's 2 RTTs

The vendor landscape.

System	SQL dialect	Clock	Notes
Spanner	Google SQL / Postgres	TrueTime (GPS + atomic)	Managed only. Externally consistent global transactions. Expensive but the gold standard.
CockroachDB	Postgres-compatible	HLC	Self-hostable. Parallel commits. Strong locality controls (REGIONAL BY ROW).
YugabyteDB	YSQL (Postgres) + YCQL (Cassandra)	HLC	Reuses Postgres's query layer wholesale. Dual API for migrations from Cassandra.
TiDB	MySQL-compatible	Centralised PD timestamps	Separate row engine (TiKV) and column engine (TiFlash). Strong story for HTAP.
AlloyDB	Postgres	Hybrid	Postgres front-end on Spanner-style disaggregated storage. Single-region focus.
FaunaDB	FQL / GraphQL	Calvin-style deterministic order	Order-then-execute, no 2PC. Strong consistency without commit-wait.

When not to reach for distributed SQL.

Single-region OLTP under ~50k QPS. A well-tuned Postgres or MySQL instance with a read replica handles this comfortably at lower latency and cost. The operational complexity of a distributed cluster is overhead you don't need.
Analytical workloads. Distributed SQL is optimised for short OLTP transactions. For multi-second aggregations over billions of rows, a column store (ClickHouse, BigQuery, Snowflake, TiFlash) is one or two orders of magnitude faster.
Microservice-per-database patterns. If each service owns a small, isolated dataset with no cross-service joins, the simplicity of one Postgres per service usually beats a shared distributed cluster.
Workloads dominated by tight write loops. The 5–15 ms commit tax adds up when a request commits ten times. Either batch the writes into one transaction, or use a single-node database that fsyncs in under a millisecond.

Distributed SQL

The promise — SQL that doesn't fit on one box.

Stateless SQL on top of a distributed KV.

Range sharding, not hash.

Raft per range — many tiny consensus groups.

The clock problem — TrueTime vs HLC.

Distributed transactions and the commit tax.

Distributed query execution.

Online schema changes.

Schema and locality — where rows live.

The vendor landscape.

When not to reach for distributed SQL.

Further reading.

10 — The page cache