Incident · 2018-10-21 · MySQL split-brain
Postmortem · Distributed systems · Databases

43 seconds, two primaries.

On the night of October 21st 2018, a planned maintenance event partitioned GitHub's US-East and US-West datacentres for 43 seconds. Orchestrator promoted a US-West replica to primary. When the link healed, the cluster had two MySQL primaries that had each accepted writes. Reconciling the divergence took roughly 24 hours of degraded service.

GitHub Engineering · 2018-10-31 · github.blog


TL;DR

A routine maintenance event on the GitHub network caused a 43-second partition between the US-East and US-West datacentres. The primary MySQL cluster sat in US-East, with replicas in US-West. Orchestrator, the MySQL HA tool GitHub ran, treated the partition as a primary failure and promoted a US-West replica to primary. When the network healed, GitHub had two MySQL primaries that had each accepted writes for those 43 seconds — a textbook split brain.

There is no automated merge for that situation. One timeline had to be chosen as canonical and the other's writes forward-replayed on top, with manual conflict resolution where the two histories had touched the same rows. The work — combined with rebuilding stale caches, replaying queued webhooks, and re-running stalled background jobs — took roughly 24 hours, from 22:52 UTC on October 21st to 23:03 UTC on October 22nd.

Timeline

Time (UTC)Event
22:52, Oct 21A planned maintenance task replaces failing 100G optical equipment. The change disconnects the US-East ↔ US-West backbone for 43 seconds.
22:52:00 – 22:52:43The MySQL primary in US-East cannot reach its US-West replicas. Orchestrator's raft-coordinated consensus across regions concludes the primary is unreachable and elects a new primary in US-West.
22:52:43Network restored. Both regions are now writing to local primaries. Internal services begin reporting inconsistent reads.
22:54Engineers paged. Site reliability declares an incident.
~23:07Decision to fail the application stack into US-West fully so writes funnel to a single primary. Cross-region writes from US-East stop.
Oct 22, early hoursBackups from before the partition are identified. Engineers begin a forward-replay strategy to bring the East-side writes into the West-side timeline.
Throughout Oct 22Replication catch-up runs over hundreds of GB of binlogs. Application throttled to keep the queue tractable.
23:03, Oct 22All services restored to normal operation. Total incident: 24 hours, 11 minutes of degraded service.

What went wrong technically

Orchestrator runs as a raft cluster of its own, with nodes in multiple regions observing the MySQL topology. When a primary becomes unreachable, Orchestrator can be configured to auto-promote a replica. GitHub's policy was tuned for the common case — a single primary dying — and tripped on a much rarer one: a brief partition where the primary was healthy but unreachable from a quorum of observers.

43 seconds is short. It is shorter than several routine network events that occur weekly across any large cross-region network — BGP convergence on a route change, a fibre cut and reroute, a switch reboot. The auto-failover threshold should have held longer before assuming the primary was dead, particularly because the cost of a wrong promotion across regions is catastrophic and the cost of a few extra seconds of unavailability is small.

Once both primaries had accepted writes, MySQL had no concept of merging the two histories. Statement-based and row-based binlogs both assume a single linear sequence per primary. A write to issues in US-East at timestamp T and a write to the same row in US-West at T+10ms had no mechanism to be ordered or combined; one of them had to be authoritative, and the other had to be replayed as if it had come after.

43 seconds was enough. A failover threshold that fires after less than a minute treats every cross-region network blip as a primary failure. Across a year of operating a multi-region backbone, partitions of that length happen several times — they almost always heal on their own. The right policy is to wait long enough that genuine primary deaths still get caught but transient partitions don't trigger promotion. Several minutes, not seconds.

The recovery

Recovery was manual operator-driven work, not an automated tool. The shape of the procedure was: pick one of the two timelines as canonical, restore it from binlogs and backups onto a clean cluster, then forward-replay the other timeline's writes on top. For rows that both sides had modified, the operator had to decide which version won and reconcile downstream state.

GitHub chose to keep the US-West timeline as canonical because failing the application stack there was the fastest way to stop new divergent writes. That meant the 43 seconds of writes that had landed in US-East after the failover needed to be brought forward. Most of those were idempotent or easy to merge — push events, login records, ephemeral state — but the user-visible cases (issues, pull_requests, comments, reviews) required care. A pull request that had been opened on the East side after the failover, and given an ID by the East primary, might collide with a different pull request that had been opened on the West side and given the same ID by the West primary.

Why split-brain recovery is manual. Two primaries that have each accepted writes do not have a shared notion of order between those writes — no Lamport timestamps, no Paxos log, no consistent snapshot. A reconciliation tool can only do bookkeeping: line up the binlogs, ask the operator to pick a winner for each conflicting row, and replay the rest. The "decide which version wins" step is the part that has to be done by a human who knows what those rows mean.
# Orchestrator topology config — the relevant thresholds
# (illustrative, not GitHub's exact production values)

RecoveryPeriodBlockSeconds                = 3600
FailureDetectionPeriodBlockMinutes        = 60
InstancePollSeconds                       = 5

# Time a primary must be unreachable before failover triggers.
# Setting this too low promotes during transient partitions.
MasterFailureDetectionPeriodBlockSeconds  = 30

# Cross-region promotion guard — require operator confirmation
# before a replica in a different region can be promoted.
PreventCrossDataCenterMasterFailover      = true
PreventCrossRegionMasterFailover          = true

The two PreventCrossRegion flags exist in Orchestrator precisely because this failure mode was known to be dangerous; the cost of forgetting to set them, or of having them off because cross-region failover was a designed recovery path, was a 24-hour outage.

User-visible impact

MySQL is on the critical path for almost everything GitHub does, so the cascade was wide:

  • Pull requests showed wrong base SHAs. The mergeability check reads from the primary; with stale or inconsistent reads, the displayed base commit drifted from what was actually in git.
  • Webhooks delayed by hours. The dispatch queue backed up while the application was throttled to keep the cross-region write catch-up tractable. Some webhooks arrived late, in a different order than the events that triggered them.
  • GitHub Pages served stale content. The build pipeline writes to MySQL and to object storage; with writes paused, deploys queued and the public-facing sites kept serving the previous version.
  • GitHub Actions queue stalled. Actions had launched only weeks earlier and depended on the same primary cluster for run state. Runs that were in-flight when the partition hit were lost or had to be re-queued.
  • API rate-limit windows reset oddly. The counters live in MySQL; clients saw both quota exhaustion and unexpected resets as the underlying tables came back into sync.

Git itself — push and clone — was largely unaffected because the git data plane doesn't go through MySQL. That is part of why the outage was framed as 24h of degraded service rather than a hard outage; the core repository operations mostly worked, and the metadata layer was the thing that was sick.

Lessons

  • Auto-failover policies are dangerous in cross-region setups. The cost of a wrong promotion across regions is split brain; the cost of a delayed promotion is a few extra seconds of unavailability. The asymmetry says cross-region promotions should require operator confirmation, not happen automatically on a 30-second timer.
  • Cross-region MySQL with synchronous replication is hard. Async replication across regions is what most teams actually run; it has bounded staleness but no split-brain risk because there's only ever one primary. The moment you make the secondary region able to become primary automatically, you have inherited the consensus problem and need a real consensus protocol.
  • The right consistency mechanism for cross-region writes is a distributed SQL engine, not MySQL with a failover tool. See Spanner — a Paxos-replicated, externally consistent SQL database makes this entire failure class structurally impossible. There is no scenario in which two Spanner replicas accept conflicting writes, because writes go through Paxos before being applied.
  • Recovery runbooks for split brain need to exist before split brain happens. A team that has not rehearsed picking a canonical timeline and replaying the other on top will spend hours of the outage figuring out the procedure. Tabletop the scenario; write down the steps; practise on a non-production copy.

What GitHub changed

The follow-up work, described in the postmortem and in subsequent engineering posts, was structural rather than cosmetic:

  • Topology change. GitHub moved away from a configuration in which a cross-region replica could be auto-promoted. The new topology kept primary writes inside a single region with bounded-staleness read replicas elsewhere; cross-region failover became a deliberate, human-initiated operation.
  • Orchestrator policy tightened. The detection windows were raised; cross-region promotion guards were turned on; the failover paths were audited for what they assumed about network conditions.
  • On-call retraining. The runbook for split-brain MySQL recovery was written down properly and rehearsed. The incident exposed how much of the response depended on individual engineers who happened to remember binlog mechanics.
  • Cross-region consistency tooling. Investment in tooling that could detect divergence between replicas more quickly, audit application assumptions about read consistency, and make the eventual move to a more consistent backing store easier.

The broader lesson

Multi-master replication across geographical regions is the failure mode that hurts. The appeal is obvious — every region can serve writes locally, the latency is great, no region is a single point of failure — but the consistency story falls apart the moment the network does. And the network always does, eventually.

The choice most teams should make is the boring one: a primary-region database with asynchronous read replicas in other regions. Writes go to one place; reads can be served locally with bounded staleness; cross-region failover is a planned event with human approval. The rare cross-region writes that need stronger guarantees can be elevated to a different mechanism — a distributed SQL engine like Spanner or CockroachDB, a deterministic transaction log in the Calvin style, or simply a workflow that tolerates a few seconds of unavailability.

GitHub's October 2018 incident is the canonical illustration of the alternative. A 43-second blip, a too-eager failover policy, and the rest of the day spent explaining to users why their pull requests looked wrong. The engineering machinery to handle this correctly exists; the harder problem is recognising ahead of time that you need it.

Further reading

  • GitHub Engineering, October 21 post-incident analysis (2018-10-31) — the original postmortem. The timeline, the technical explanation, and the commitments to follow-up work, written by the team that lived it.
  • Spanner — Corbett et al 2012, annotated — the distributed SQL design that makes split brain structurally impossible. Read it as the alternative architecture: Paxos per shard, external consistency via TrueTime, no failover tool to misfire.
  • Calvin — Thomson et al 2012, annotated — a different attack on the same problem. A deterministic global sequencer orders transactions before they run, so replicas never need to coordinate to agree on outcomes.
  • Orchestrator — recovery configuration docs — the failover knobs themselves. FailureDetectionPeriodBlockMinutes, PreventCrossRegionMasterFailover, and the rest of the policy surface. Worth reading even if you don't run Orchestrator, as a study in what a real-world HA tool has to expose.
  • Kingsbury, Jepsen analyses (various) — the body of work that demonstrated, over and over, that database HA tools lose data under partitions. Useful background for why the 2018 incident was not surprising in retrospect.
Found this useful?