01 / 20
Topics / 01

Replication

Replication keeps copies of the same data on more than one node: for fault tolerance, for read throughput, for low-latency reads near users, and for offline analytics. Every replicated system makes three roughly independent choices: where writes go, how updates reach the copies, and what readers are allowed to see. Get any of them wrong and you end up with split-brain, lost writes, or stale reads that never converge.


Three shapes — where writes go

There are exactly three places writes can land in a replicated system, and every database vendor in production picks one (or blends a couple).

ShapeWrite pathUsed by
Single-leader All writes go to one leader; followers replicate from it Postgres, MySQL, MongoDB, Kafka (per partition), Raft-backed systems
Multi-leader Writes accepted by any of N leaders; leaders replicate to each other Multi-region MySQL, Cassandra (per-key), CRDT-based stores, calendar/notes apps
Leaderless Clients write to (and read from) a quorum of replicas directly Dynamo, Riak, Cassandra (the original Dynamo-style mode), Voldemort

Single-leader is the default for a reason: it's the only one with a single source of truth and a clean order for writes. Multi-leader and leaderless trade that simplicity for the ability to keep accepting writes during a network partition.

Synchronous vs asynchronous

Once a leader accepts a write, when does it acknowledge to the client?

  • Asynchronous. Leader writes locally, returns success, ships the change to followers in the background. Latency is leader-only. If the leader dies before the change reaches a follower, the write is lost.
  • Synchronous. Leader writes locally, waits until at least one follower has the change, then returns success. Latency is leader + slowest sync follower. No acknowledged write is ever lost, as long as the follower didn't lie or also fail.
  • Semi-synchronous. Wait for at least one follower; if no follower is available, downgrade to async (Postgres ANY 1) or block (MySQL rpl_semi_sync_master_wait_no_slave=ON). The pragmatic middle.
The "f+1 of 2f+1" formulation. Consensus systems (Raft, Paxos) generalise this: a write is durable when it's on a majority of replicas. With three replicas, two must ack. With five, three must ack. Tolerates f failures with 2f+1 nodes, a different way to spell synchronous replication.

How updates propagate

There are four common ways to ship a write to a replica:

  1. Statement-based. Replicate the SQL statement. Simple to log, but breaks on non-determinism (NOW(), RAND(), AUTO_INCREMENT under concurrent inserts). MySQL's original replication and Postgres logical with statements.
  2. Write-ahead log shipping. Replicate the physical WAL records — the exact byte changes the storage engine made to pages. Tightly couples replicas to the engine version (a Postgres 14 follower can't replay WAL from a Postgres 15 leader). Postgres streaming replication.
  3. Logical / row-based. Replicate the logical changes — "update row X in table T from {a:1} to {a:2}". Decouples versions, lets you replicate to a different schema, lets you pick which tables to replicate. MySQL row-based binlog, Postgres logical decoding, MongoDB oplog.
  4. Trigger-based. Application or schema-level triggers write change records to an outbox the replicator reads. The fallback for when the engine doesn't support logical decoding.

Replication lag

Async replication produces lag: the gap between "leader has accepted" and "all followers have applied". In steady state lag is milliseconds. Under heavy write load, schema changes, or a slow follower, it grows without bound.

Lag is more than a "stale read" problem. It causes three specific anomalies:

  • Reading your own writes. You write to the leader, then read from a lagging follower, and don't see your own change. Fixes: route reads-after-write to the leader; track a per-user "last write LSN" and route to a follower that's caught up past it.
  • Monotonic reads. User refreshes; the first request hits a fresh follower and sees the change; the second hits a lagging follower and the change disappears. Fix: pin a session to a single follower.
  • Causal consistency violations. A reads X=1, writes Y=A. B reads Y=A, then reads X=0. Fix: track causal dependencies (vector clocks, version vectors, or a single linearisable layer like Spanner / FaunaDB).

Failover and split-brain

When a leader dies, something has to elect a new one. Three approaches in production:

ApproachMechanismRisk
ManualOperator promotes a followerSlow recovery, low risk of split-brain
External orchestratorPatroni, Orchestrator, RDS — watch the leader, promote on missNetwork partition can promote a second leader
Built-in consensusRaft / Paxos within the clusterBuilt to be split-brain-safe; slower writes by design

Split-brain is what happens when two nodes both think they're the leader. Both accept writes. The writes never reconcile. The classic cause: a network partition where followers can reach a new candidate but not the old leader, the candidate gets promoted, the partition heals, and now there are two leaders writing different histories. Fencing (STONITH, lease tokens) is the engineering answer; consensus is the algorithmic one.

Why Raft makes failover easy. Raft needs a majority quorum to elect a leader. In a partition, only the side with a majority can elect; the minority side refuses writes. No split-brain by construction. Cost: writes need a majority of nodes to ack, so you can't accept writes during a majority partition.

Multi-leader and conflict resolution

Multi-leader replication accepts writes at several sites at once. The win: low-latency writes everywhere, plus the system stays writable when a region is offline. The cost: two writes that touch the same key at the same time conflict, and the replication protocol has to resolve them. Three families of resolution:

  • Last-writer-wins (LWW). Use wall-clock timestamps; the latest wins. Cheap, but clock skew silently loses writes. Don't use LWW for anything you can't re-derive.
  • Merge function. Application-level: "for shopping carts, take the union of items." Riak's siblings, Cassandra's cassandra.yaml read-repair.
  • CRDTs. Conflict-free replicated data types: counters, sets, maps with mathematically-defined merge rules that always converge no matter the order. What modern collaborative editors (Figma, Linear, Notion) use under the hood.

Leaderless reads, with quorums

Dynamo-style systems don't have a leader at all. Clients (or a coordinator) write to N replicas and consider the write successful when W of them ack. Reads go to N replicas and the client takes the value with the highest version once R have responded.

The invariant: W + R > N guarantees that any read sees at least one replica that took part in the latest write. Common knobs:

  • N=3, W=2, R=2. Tolerates one replica failure on either side.
  • N=3, W=3, R=1. Fast reads, writes block if any replica is down.
  • N=3, W=1, R=3. Fast writes, but every read pays for all three.

Concurrent writes still produce conflicts (two clients writing different values at roughly the same moment). Dynamo uses vector clocks to detect them and returns the conflicting siblings to the application. Your code resolves them.

Further reading

Found this useful?