New to this? · ELI5 · 1 min Read Split-brain explained simply, in plain English

Split-Brain Simulator: when the network breaks in two.

Split-brain is what happens when a network partition lets two halves of a cluster both believe they are in charge and accept conflicting writes. Five nodes here are drawn as a ring. Click a link to partition the network. Click a node to send a write through it. Toggle between three system designs and watch each one behave: quorum refuses writes on the minority side; unsafe mode corrupts both sides; last-write-wins lets clocks decide.

quorum side

5/5

can-write

conflicts

Mode: ·

Reset. 5 nodes, all links up. Node 3 is leader at term 1.

click a link to cut/heal · click a node to send a write

Per-node state

id	role	term	reachable	last log
N1	follower	1	5/5	—
N2	follower	1	5/5	—
N3	leader	1	5/5	—
N4	follower	1	5/5	—
N5	follower	1	5/5	—

side 1: N1, N2, N3, N4, N5 (5/5) majority · quorum ok

Try this

Click Split 3-2. In quorum mode, click N1 to write — refused. Click N3 — committed. Click Heal all. The minority side catches up.
Switch to Unsafe, click Split 3-2, then write through N1 and N3. Both sides accept. The conflicts counter climbs.
Click Split 2-2-1. Now nothing has quorum. Every write in quorum mode is refused; LWW still takes them.
With the cluster split, click Kill leader. Watch a new leader emerge on the majority side at a higher term.
In LWW, write through both sides, then heal. Highest timestamp wins — but the timestamps include simulated clock skew, so the "wrong" write can win.

What CAP actually trades

CAP isn't "pick 2 of 3." Brewer's 2012 clarification: during a network partition, you must choose between consistency and availability. Without a partition, you have both.

Quorum systems (etcd, ZooKeeper, Consul, Raft, Paxos) pick C: the minority side becomes unavailable for writes. AP systems (Dynamo, Cassandra, Riak) pick A: both sides take writes and reconcile later. Most real production systems are configurable per operation — Cassandra lets you set QUORUM or ONE per read and per write.

Adjacent

What you're looking at

The ring is five nodes. Green links are up; dashed red ones are cut. Each node's badge shows its role — L for leader, F for follower, × for dead — and its term. The green tint marks the side that holds a majority, and the highlighted node is the leader. Click a link to partition or heal the network; click a node to push a client write through it. The mode buttons switch the system's design: quorum (Raft-style), unsafe split-brain, and last-write-wins. The counters track which side has quorum, how many nodes can take writes, and the running conflict count.

Hit Split 3-2 in quorum mode, then write through N1 on the minority side — refused — and through N3 on the majority side — committed; heal, and the minority catches up. The surprise is what the other modes do with the same cut. Unsafe mode commits on both sides and the conflict counter climbs; last-write-wins takes every write and resolves on heal by timestamp, including the simulated clock skew, so the later write can quietly lose. Try Split 2-2-1: now no side has a majority and every quorum write fails. That refusal is the availability cost the safe design pays on purpose.

Networks partition. They always partition.

This is not a hypothetical edge case. It's Tuesday.

GitHub's October 21, 2018 incident is the textbook example. A 43-second partition between the US-east and US-west MySQL clusters — caused by routine maintenance on Equinix's optical equipment — left both sides briefly believing they were the primary. The split-brain itself lasted under a minute. The cleanup took 24 hours, because the orchestrator (Orchestrator + MySQL semi-sync) had failed over while writes were still landing on the original primary. Reconciliation required manually examining binary logs and rebuilding replicas. GitHub's post-mortem is a masterclass in why a 43-second network blip can cost a day of degraded service.

Cloudflare's June 2019 outage was a Verizon BGP leak: a small ISP in Pennsylvania announced a large chunk of the internet through its peer Verizon, which propagated it to every transit provider that didn't filter customer announcements. Half the internet tried to route through that ISP for half an hour. Every distributed system on the affected paths saw it as a partition. Anyone running consensus across regions experienced exactly the scenario this simulator models, only larger.

You don't need a BGP leak to see this. Every SSH session that hangs for 30 seconds before the keepalive recovers it is a network partition between you and the server. Every TCP retransmit you watch in ss -i is a partition the kernel hid for you. Systems that don't have a written story for partitions fail loudly the first time a switch reboots in the middle of the night, not gradually over months. The Jepsen report archive (Kyle Kingsbury, jepsen.io) exists because vendors keep claiming their system handles partitions and keep being wrong about that.

Quorum: why a majority is the only safe number

Three protocols, four decades, one answer.

Lamport's Paxos (1989 tech report, 1998 paper "The Part-Time Parliament"), Liskov and Oki's Viewstamped Replication (1988), and Ongaro and Ousterhout's Raft (2014) all solve the same problem: replicate a log of operations across N nodes such that a majority must acknowledge each entry before it counts as committed. Raft is the one most people implement today because Ongaro's PhD thesis was written deliberately as a teaching artifact — the original Paxos paper was so confusing that Lamport rewrote it twice and people still got it wrong.

The reason "majority" is the answer and not "any three" or "all five" is set-theoretic. Two disjoint majorities of the same set are impossible: in a 5-node cluster, any majority has at least 3 nodes, and any two majorities must share at least one node. After a partition, at most one side has a majority. The minority side cannot reach quorum and therefore cannot commit anything that would conflict with what the majority side commits. That single property — and not anything about leaders or terms or heartbeats — is what prevents split-brain.

The cost is availability. With N=3 you tolerate 1 failure; with N=5 you tolerate 2. Beyond f = ⌊(N-1)/2⌋ simultaneous failures, the system stops accepting writes. This is the entire premise behind the FLP impossibility result (Fischer, Lynch, Paterson 1985): no fully-asynchronous consensus protocol can guarantee termination with even a single failure. The practical workaround every consensus system uses is some form of timeout-based randomization (Raft's random election timeout) or assumed partial synchrony (DLS 1988). etcd, Consul, ZooKeeper, CockroachDB, and Spanner all live in this corner.

Eventual consistency: the other side of the trade

Dynamo, Cassandra, Riak — and why the clock keeps lying.

Amazon's 2007 Dynamo paper (DeCandia et al., SOSP) made eventual consistency respectable. Dynamo takes writes on any reachable node, even during a partition. Conflicts are resolved later: last-write-wins keyed on a wall clock, or vector clocks that record causal history and surface concurrent updates to the application, or CRDTs that mathematically guarantee any merge order produces the same result. Cassandra (2008, originally Facebook) and Riak (2009) are the open-source heirs. Both are still in heavy production at companies that need writes to never block.

The trade is that writes never fail, but reads can return stale or conflicting values, and the conflict resolution is your problem. Last-write-wins is the worst of these options because clocks lie. NTP skew can be tens of milliseconds in a well-run datacenter and seconds in a poorly-configured one; leap seconds have crashed production systems (the 2012 leap second caused outages at Reddit, Mozilla, Qantas, LinkedIn). A write with a timestamp from a clock that's 100 ms ahead will silently overwrite a later write from a slower clock. Google built TrueTime (Spanner) so they could stop pretending this wasn't a problem.

CRDTs are the modern correct answer when you can use them. Counters, sets, last-writer-wins registers, ordered sequences (for collaborative text editing) — all have well-defined commutative merges. Redis ships CRDT modules; Riak has built-in data types; Yjs and Automerge power most of the "Google Docs but local-first" wave. The catch is that CRDTs only cover operations you can express commutatively. Transfer $100 from account A to B is not a CRDT operation. If your business requires transactional invariants, you're back to consensus, and probably to a system in the etcd / Spanner / CockroachDB camp.

Don't try to detect split-brain. Design to make it impossible.

Fencing tokens, leases, and STONITH.

Once an unsafe system has committed conflicting writes on two sides, recovery is manual. There is no algorithm that can tell you whether the customer's bank transfer at 14:02:17 on side A or the contradictory transfer at 14:02:18 on side B is the one to keep. Designs that prefer to crash (fail-fast, fence everything off) are an order of magnitude easier to operate than designs that try to merge after the fact. This is the lesson behind STONITH ("shoot the other node in the head") from Pacemaker and the Linux-HA project: when in doubt about who is the primary, physically power off the suspected stranded side. Brutal, but unambiguous.

Modern Kubernetes leader election uses lease-based fencing instead. A controller holds a lease in coordination.k8s.io/Lease; only the lease holder may act as primary; if the controller is partitioned away from the API server, it cannot renew the lease and must stop. The lease store (etcd, behind the API) is the single arbiter — there is no scenario where two controllers both think they hold the lease, because etcd's Raft guarantees a single linearizable history of lease assignments. Same idea as Martin Kleppmann's well-known critique of Redlock: distributed locks without fencing tokens are not actually distributed locks.

Kyle Kingsbury's Jepsen reports keep finding the same pattern: when a system claims to handle partitions, the bug is almost always in the recovery path, not the steady-state path. Aphyr's "How to Lose Data and Influence People" series documents Redis, MongoDB, Elasticsearch, Cassandra, and others losing committed acknowledged writes under partition. The fix is rarely "detect the partition faster." The fix is "use a fencing token so the late writer's request gets rejected, even if the late writer thinks it's still primary." Build for prevention. Detection is a consolation prize.

Found this useful?