11 / 20

Topics / 11

Failure detectors

"Has the other node died?" sounds simple. Over a network where messages can be delayed forever, it is the question FLP impossibility proved nobody can answer correctly in every case. Real systems get close with heartbeats, phi-accrual, and SWIM, by trading completeness for accuracy and living with some false suspicions.

Why detection is hard

Take two nodes A and B exchanging messages. A hasn't heard from B in 5 seconds. Three possibilities: (1) B has crashed, (2) the network between them is partitioned, (3) B is alive but a GC pause, overloaded queue, or network blip has delayed its reply. A can't tell these apart from where it sits. All it has is the absence of a message.

Decide too fast that B is dead and the system can lose data or elect two leaders when B was actually alive (split brain). Decide too slowly and operations stall waiting on a corpse. The whole field of failure detectors is about making this trade-off on purpose.

Chandra-Toueg classes

The 1996 paper "Unreliable Failure Detectors for Reliable Distributed Systems" by Chandra and Toueg sorted detectors along two axes: completeness (does every failed node eventually get suspected?) and accuracy (does the detector ever wrongly suspect a live node?).

P (Perfect) — strongly complete + strongly accurate. Never wrong. Impossible in asynchronous systems (this is FLP).
S (Strong) — strongly complete + eventually strongly accurate. After some unknown point, only failed nodes are suspected.
◊P (Eventually Perfect) — what real systems get close to. Both properties hold eventually.
W (Weak) — at least one process eventually suspects every failed process. The bare minimum for solving consensus under partial synchrony.

Consensus needs at minimum ◊W. Raft, Paxos, and PBFT all quietly assume ◊P during normal operation and degrade gracefully when accuracy drops for a while.

The heartbeat detector

The simplest version. A monitors B by sending heartbeats on a schedule; B replies. If A misses N heartbeats in a row, A declares B dead. Timeout T and N are the tuning knobs. Smaller T = faster detection, more false positives during network jitter. Larger T = slow detection.

Production defaults vary. etcd Raft uses a 150–300 ms heartbeat with election timeout at 1 second. Kubernetes node-monitor-grace-period waits 40 seconds before marking a node NotReady. Cassandra's old TCP-based failure detector used 5 seconds; the phi-accrual variant (since 0.6) does much better.

The 1-RTT rule. Make the heartbeat interval ≥ 2× the typical p99 network RTT, and the timeout ≥ 4× the heartbeat. Anything tighter triggers false suspicions on routine network blips. For datacentre traffic (p99 ~ 1ms), a 10ms heartbeat with 50ms timeout works. For cross-region (p99 ~ 100ms), a 500ms heartbeat with 2s timeout is closer to reality.

Phi-accrual

The accrual failure detector (Hayashibara, 2004) replaces binary alive/dead with a continuous suspicion level, phi. Phi grows as the time since the last heartbeat runs past the historical mean. The application picks a threshold: phi=8 means "probably dead", phi=12 means "definitely dead". The detector adapts to the heartbeat-interval distribution it actually sees, not a fixed threshold.

Math: assume heartbeat arrival times are normally distributed with mean μ and stddev σ from the recent history. phi(t) = -log10(probability that t since last heartbeat is normal given the distribution). At t = μ + σ, phi ≈ 1; at t = μ + 4σ, phi ≈ 8.

Cassandra uses phi-accrual with a default threshold of 8. Akka cluster uses it too. The win: a flaky high-latency link doesn't set off constant false alarms, and a truly dead node hits the threshold fast.

SWIM

SWIM ("Scalable Weakly-consistent Infection-style Membership", Das/Gupta/Motivala 2002) splits failure detection from membership broadcast and uses gossip to scale. Each node pings a random other node every so often. If no ack arrives in T, it asks K random other nodes to ping the target on its behalf. If none of them get an ack, the target is marked suspect, and the suspicion is gossiped to the cluster.

SWIM runs in O(N) total messages per round no matter the cluster size (each node only pings 1 + K others), and detection time is O(log N) through gossip. HashiCorp Serf uses SWIM; Consul's gossip layer uses Serf; Lifeguard (Schmidt 2018) is a SWIM extension that behaves better during network instability.

Adaptive timeouts

Beyond phi-accrual, modern systems adapt timeouts from observed network behaviour. AWS's Maelstrom paper, Google's gRPC docs, and Envoy's outlier detection all use percentile-based timeouts (set timeout = p99 of observed RTT × multiplier). When p99 climbs, timeouts grow; when it falls, they tighten. That sidesteps the static-tuning trap.

The phi-accrual math, worked out

Phi-accrual is simple statistics under a fancy name. The detector keeps a sliding window (typically the last 1000 inter-arrival times) of how long heartbeats took to arrive. From that window it computes the mean μ and standard deviation σ. Then, given the time since the last heartbeat, it works out the probability that "no heartbeat in this long" fits the historical distribution, and reports the negative log of that probability.

Assume a normal distribution for simplicity (Cassandra uses a more conservative exponential tail, but the shape is similar):

P(no heartbeat in t seconds) ≈ 1 − Φ((t − μ) / σ)
phi(t) = −log10( P(no heartbeat in t) )

Examples with μ = 100ms, σ = 20ms:
  t =  120ms  →  P ≈ 0.16   →  phi ≈ 0.8     "fine"
  t =  140ms  →  P ≈ 0.023  →  phi ≈ 1.6     "slow"
  t =  180ms  →  P ≈ 0.0002 →  phi ≈ 3.7     "concerning"
  t =  220ms  →  P ≈ 1e-7   →  phi ≈ 7       "alert"
  t =  260ms  →  P ≈ 7e-12  →  phi ≈ 11      "treat as dead"

The Cassandra default threshold is phi = 8, which in the example above sits at roughly 8σ past the mean. The number feels reasonable, since a delay that long is vanishingly unlikely if the node and network are healthy. But the real probability depends on the distribution's true tail, which is fatter than normal in real networks. Production tuning is mostly about picking the right threshold for how many false suspicions the workload can tolerate.

The big win over a static timeout is that phi-accrual adapts to the link's real behaviour. A cross-region link with μ = 50ms, σ = 8ms gets a sharp curve; a local-VM link with μ = 1ms, σ = 0.2ms gets an even sharper one. Both reach phi = 8 at "way past expected", but "way past expected" is measured against the observed distribution rather than a hard-coded number.

SWIM in detail — the protocol that scales

SWIM's design solves three problems at once: keep detection time bounded, keep network load bounded, and keep the false-positive rate bounded. The protocol has three phases.

1. Direct ping. Every T_protocol milliseconds, node N picks a random member M from its membership list and sends a PING. M replies with ACK. If the ACK arrives within T_ping, M is alive and the round ends.

2. Indirect ping (the clever part). If no ACK arrives, N picks K random other members and sends each a PING-REQ for M. Each of those nodes pings M directly. If any of them gets a reply, it relays an ACK back to N. If none do within T_round, M is marked SUSPECTED.

3. Suspicion broadcast via piggyback gossip. The SUSPECTED state is added to a piggyback buffer that rides on regular SWIM pings (no new messages, just extra bytes in existing ones). Within a few protocol rounds (O(log N) rounds), every node has heard the suspicion. If M responds to anyone during this gossip window, it rejoins as ALIVE. If the suspicion window expires, M is declared DEAD.

The indirect-ping phase is what gives SWIM its low false-positive rate. A dead node fails every direct ping. A node behind a flaky network path only fails the path to N, but K random other nodes likely have working paths to M. A false detection needs N and all K helpers to lack a working path at the same time. That's vanishingly unlikely unless M is truly partitioned or dead.

Hashicorp Memberlist (Serf, Consul) defaults:
  T_protocol  = 1s         // direct ping interval
  K           = 3          // helpers for indirect ping
  T_round     = 5s         // before SUSPECT
  suspect window = 5s      // before DEAD declared

Per-node message load:
  outgoing: ~1 ping/s + K piggyback bytes per ping
  incoming: ~1 ping + 1 ack on average
total cluster-wide: O(N) messages per second, regardless of cluster size

Detection time:
  T_round + suspect window + gossip propagation ≈ 10-15 seconds

Hashicorp's Lifeguard (2018) extended SWIM with two changes that improve behaviour under network stress. Local Health Aware Suspicion uses the local node's recent ack failure rate to scale the suspicion timeout. A node that has been failing its own pings extends suspicion for its peers, on the theory that the local node may be the unhealthy one. Dogpile mitigation cuts the chance of several suspicion broadcasts piling onto the same target. Both ship in Consul and reduce flakiness during real datacenter incidents.

Why the perfect detector is impossible

Chandra-Toueg classified detectors; FLP (Fischer-Lynch-Paterson, 1985) proved one of them can't exist. The theorem: in an asynchronous system with even one possible crash failure, no deterministic algorithm can solve consensus. Since consensus needs some form of accurate failure detection, the perfect failure detector is also impossible in that setting.

The intuition is the same one you hit when reasoning about timeouts. A node that hasn't responded in t seconds may be dead, or may be a millisecond from responding. No finite t reliably tells the two cases apart. The detector is forced to either be incomplete (let some failures go undetected, by setting t too large) or inaccurate ( suspect live nodes, by setting t too small). The Chandra-Toueg classification is exactly the set of trade-offs this corner forces.

Real systems live with this by being honest about the trade-off. Raft uses ◊P semantics: a leader that hears no heartbeats for an election timeout assumes the cluster has lost it and steps down; followers that hear no heartbeats start a new election. Inaccurate detection (a working leader briefly suspected dead) causes leader churn, which is disruptive but recoverable. Cassandra phi-accrual is also ◊P in practice: marked-dead nodes can come back, and the cluster handles it through gossip.

The "machine up" vs "service healthy" gap

Most production failure detection is layered. There's a low-level liveness signal (the machine is responding to pings), a middle-level one (the process is up and responding to TCP), and a high-level one (the application is processing requests correctly). These can diverge in interesting ways:

Machine up, process down. The kernel is healthy; the service died. TCP RSTs on the service port; liveness check fails immediately. Easy case.

Process up, application stuck. The process is alive — accepting TCP connections, ackowledging heartbeats — but its worker pool is fully saturated, its background loop has deadlocked, or a critical thread has wedged. Heartbeats pass but real requests time out. The Kubernetes liveness/readiness split exists for exactly this. Readiness fails before liveness, pulling the pod out of rotation without restarting it; liveness fails only when the process is truly hung.

Application healthy, dependencies down. The service is fine but the database behind it is unreachable. Most teams mistake this for a service problem and page on liveness. Mature observability separates "this service" from "this service's critical dependencies", with different alerts for each.

Picking the right level for the failure detector matters. A consensus protocol watches the lowest level (peer process is up); a load balancer watches the highest (peer can actually serve traffic); a service mesh watches both. Mixing these layers, like using the load-balancer's health-check signal to drive consensus leader election, leads to thrashing during dependency outages.

Detection time vs detection accuracy — the empirical curve

A table of operational defaults across systems, with the trade-off each one reflects:

System              Detection time      False positive rate     Use case
etcd Raft           1 sec               low                     critical state, latency-tolerant
Cassandra phi-8     ~3-10 sec           very low                eventual consistency, low overhead
Consul Serf         ~10-15 sec          very low                large clusters, gossip
Kubernetes node     40 sec              very low                production workloads
TCP keepalive       ~2 hr (default)     extremely low           generic network liveness
HAProxy health      2 sec (typical)     medium                  load balancing
Envoy outlier       configurable        configurable            traffic management

The pattern: faster detection means more thrashing during normal network noise. Critical leadership-election systems push detection times down and accept the occasional unnecessary failover. Large-cluster membership protocols push them up because false positives cascade through gossip and waste cluster-wide work.

The most common production mistake is using a system's default detection time without asking whether it matches the workload. Kubernetes's 40-second node-not-ready-toleration suits stateful workloads on noisy networks; for latency-critical stateless workloads behind a load balancer, evictions could happen much faster, and many teams tune this without seeing the trade-off.

What failure detection cannot fix

Two failure modes that detectors handle badly. Both need system design help rather than better detection.

Grey failure. A node is up, responsive, processing some requests correctly, but failing a subset: a slow disk on one volume, a single dependency unreachable, a bug in one code path. The failure detector sees heartbeats; the application sees errors. The fix is to add per-request error observability (Envoy outlier detection, application-level circuit breakers) so the signal that drives evictions is "this node is failing my requests", not "this node is pinging me back".

Asymmetric partition. A can talk to B, but B cannot talk to A. Both sides' detectors see different worlds. Common with NAT, asymmetric routing, or one-way firewall rules. The fix is bidirectional detection (the protocol must work even when only one direction is alive) plus gossip-based confirmation (other nodes can back up A's view of B).

Both of these are why production systems layer several detection signals and almost never trust a single source. The detector tells you a node might be in trouble; the policy on top of that signal decides what to do about it.

Common mistakes

Treating timeout as "node is dead" rather than "I have not heard from node". The first is a falsifiable claim about the world. The second is honest. Two-phase commit and its descendants made this mistake; Paxos avoided it by treating suspicion as a hint, not a fact. Setting timeouts off the average RTT rather than the tail. Setting Kubernetes node-not-ready-toleration-duration too low and thrashing pods during routine GC pauses on the kubelet. Conflating"machine is up" (PING works) with "service is healthy" (the application is making forward progress). Many production outages come from this gap.

Failure detectors

Why detection is hard

Chandra-Toueg classes

The heartbeat detector

Phi-accrual

SWIM

Adaptive timeouts

The phi-accrual math, worked out

SWIM in detail — the protocol that scales

Why the perfect detector is impossible

The "machine up" vs "service healthy" gap

Detection time vs detection accuracy — the empirical curve

What failure detection cannot fix

Common mistakes

Further reading