Replication
Replication keeps copies of the same data on more than one node: for fault tolerance, for read throughput, for low-latency reads near users, and for offline analytics. Every replicated system makes three roughly independent choices: where writes go, how updates reach the copies, and what readers are allowed to see. Get any of them wrong and you end up with split-brain, lost writes, or stale reads that never converge.
Three shapes — where writes go
There are exactly three places writes can land in a replicated system, and every database vendor in production picks one (or blends a couple).
| Shape | Write path | Used by |
|---|---|---|
| Single-leader | All writes go to one leader; followers replicate from it | Postgres, MySQL, MongoDB, Kafka (per partition), Raft-backed systems |
| Multi-leader | Writes accepted by any of N leaders; leaders replicate to each other | Multi-region MySQL, Cassandra (per-key), CRDT-based stores, calendar/notes apps |
| Leaderless | Clients write to (and read from) a quorum of replicas directly | Dynamo, Riak, Cassandra (the original Dynamo-style mode), Voldemort |
Single-leader is the default for a reason: it's the only one with a single source of truth and a clean order for writes. Multi-leader and leaderless trade that simplicity for the ability to keep accepting writes during a network partition.
Synchronous vs asynchronous
Once a leader accepts a write, when does it acknowledge to the client?
- Asynchronous. Leader writes locally, returns success, ships the change to followers in the background. Latency is leader-only. If the leader dies before the change reaches a follower, the write is lost.
- Synchronous. Leader writes locally, waits until at least one follower has the change, then returns success. Latency is leader + slowest sync follower. No acknowledged write is ever lost, as long as the follower didn't lie or also fail.
- Semi-synchronous. Wait for at least one follower; if no follower is
available, downgrade to async (Postgres
ANY 1) or block (MySQLrpl_semi_sync_master_wait_no_slave=ON). The pragmatic middle.
How updates propagate
There are four common ways to ship a write to a replica:
- Statement-based. Replicate the SQL statement. Simple to log, but breaks on non-determinism (NOW(), RAND(), AUTO_INCREMENT under concurrent inserts). MySQL's original replication and Postgres logical with statements.
- Write-ahead log shipping. Replicate the physical WAL records — the exact byte changes the storage engine made to pages. Tightly couples replicas to the engine version (a Postgres 14 follower can't replay WAL from a Postgres 15 leader). Postgres streaming replication.
- Logical / row-based. Replicate the logical changes — "update row X in table T from {a:1} to {a:2}". Decouples versions, lets you replicate to a different schema, lets you pick which tables to replicate. MySQL row-based binlog, Postgres logical decoding, MongoDB oplog.
- Trigger-based. Application or schema-level triggers write change records to an outbox the replicator reads. The fallback for when the engine doesn't support logical decoding.
Replication lag
Async replication produces lag: the gap between "leader has accepted" and "all followers have applied". In steady state lag is milliseconds. Under heavy write load, schema changes, or a slow follower, it grows without bound.
Lag is more than a "stale read" problem. It causes three specific anomalies:
- Reading your own writes. You write to the leader, then read from a lagging follower, and don't see your own change. Fixes: route reads-after-write to the leader; track a per-user "last write LSN" and route to a follower that's caught up past it.
- Monotonic reads. User refreshes; the first request hits a fresh follower and sees the change; the second hits a lagging follower and the change disappears. Fix: pin a session to a single follower.
- Causal consistency violations. A reads X=1, writes Y=A. B reads Y=A, then reads X=0. Fix: track causal dependencies (vector clocks, version vectors, or a single linearisable layer like Spanner / FaunaDB).
Failover and split-brain
When a leader dies, something has to elect a new one. Three approaches in production:
| Approach | Mechanism | Risk |
|---|---|---|
| Manual | Operator promotes a follower | Slow recovery, low risk of split-brain |
| External orchestrator | Patroni, Orchestrator, RDS — watch the leader, promote on miss | Network partition can promote a second leader |
| Built-in consensus | Raft / Paxos within the cluster | Built to be split-brain-safe; slower writes by design |
Split-brain is what happens when two nodes both think they're the leader. Both accept writes. The writes never reconcile. The classic cause: a network partition where followers can reach a new candidate but not the old leader, the candidate gets promoted, the partition heals, and now there are two leaders writing different histories. Fencing (STONITH, lease tokens) is the engineering answer; consensus is the algorithmic one.
Multi-leader and conflict resolution
Multi-leader replication accepts writes at several sites at once. The win: low-latency writes everywhere, plus the system stays writable when a region is offline. The cost: two writes that touch the same key at the same time conflict, and the replication protocol has to resolve them. Three families of resolution:
- Last-writer-wins (LWW). Use wall-clock timestamps; the latest wins. Cheap, but clock skew silently loses writes. Don't use LWW for anything you can't re-derive.
- Merge function. Application-level: "for shopping carts, take the
union of items." Riak's siblings, Cassandra's
cassandra.yamlread-repair. - CRDTs. Conflict-free replicated data types: counters, sets, maps with mathematically-defined merge rules that always converge no matter the order. What modern collaborative editors (Figma, Linear, Notion) use under the hood.
Leaderless reads, with quorums
Dynamo-style systems don't have a leader at all. Clients (or a coordinator) write to N replicas and consider the write successful when W of them ack. Reads go to N replicas and the client takes the value with the highest version once R have responded.
The invariant: W + R > N guarantees that any read sees at least one replica that took part in the latest write. Common knobs:
- N=3, W=2, R=2. Tolerates one replica failure on either side.
- N=3, W=3, R=1. Fast reads, writes block if any replica is down.
- N=3, W=1, R=3. Fast writes, but every read pays for all three.
Concurrent writes still produce conflicts (two clients writing different values at roughly the same moment). Dynamo uses vector clocks to detect them and returns the conflicting siblings to the application. Your code resolves them.
Further reading
- DeCandia et al. — Dynamo: Amazon's Highly Available Key-Value Store — the leaderless, quorum-based design that started a generation of databases.
- Ongaro & Ousterhout — Raft — consensus as a replication protocol, made teachable.
- PostgreSQL — High availability docs — a thorough catalogue of every replication knob a single-leader system actually offers.
- Kingsbury — Strong consistency models — the consistency hierarchy that determines what readers can and can't see.
- Kleppmann — DDIA, ch. 5 — the most thorough textbook treatment of replication.