Failure detectors
"Has the other node died?" sounds simple. Over a network where messages can be delayed forever, it is the question FLP impossibility proved nobody can answer correctly in every case. Real systems get close with heartbeats, phi-accrual, and SWIM, by trading completeness for accuracy and living with some false suspicions.
Why detection is hard
Take two nodes A and B exchanging messages. A hasn't heard from B in 5 seconds. Three possibilities: (1) B has crashed, (2) the network between them is partitioned, (3) B is alive but a GC pause, overloaded queue, or network blip has delayed its reply. A can't tell these apart from where it sits. All it has is the absence of a message.
Decide too fast that B is dead and the system can lose data or elect two leaders when B was actually alive (split brain). Decide too slowly and operations stall waiting on a corpse. The whole field of failure detectors is about making this trade-off on purpose.
Chandra-Toueg classes
The 1996 paper "Unreliable Failure Detectors for Reliable Distributed Systems" by Chandra and Toueg sorted detectors along two axes: completeness (does every failed node eventually get suspected?) and accuracy (does the detector ever wrongly suspect a live node?).
- P (Perfect) — strongly complete + strongly accurate. Never wrong. Impossible in asynchronous systems (this is FLP).
- S (Strong) — strongly complete + eventually strongly accurate. After some unknown point, only failed nodes are suspected.
- ◊P (Eventually Perfect) — what real systems get close to. Both properties hold eventually.
- W (Weak) — at least one process eventually suspects every failed process. The bare minimum for solving consensus under partial synchrony.
Consensus needs at minimum ◊W. Raft, Paxos, and PBFT all quietly assume ◊P during normal operation and degrade gracefully when accuracy drops for a while.
The heartbeat detector
The simplest version. A monitors B by sending heartbeats on a schedule; B replies. If A misses N heartbeats in a row, A declares B dead. Timeout T and N are the tuning knobs. Smaller T = faster detection, more false positives during network jitter. Larger T = slow detection.
Production defaults vary. etcd Raft uses a 150–300 ms heartbeat with election timeout at 1 second. Kubernetes node-monitor-grace-period waits 40 seconds before marking a node NotReady. Cassandra's old TCP-based failure detector used 5 seconds; the phi-accrual variant (since 0.6) does much better.
Phi-accrual
The accrual failure detector (Hayashibara, 2004) replaces binary alive/dead with a continuous suspicion level, phi. Phi grows as the time since the last heartbeat runs past the historical mean. The application picks a threshold: phi=8 means "probably dead", phi=12 means "definitely dead". The detector adapts to the heartbeat-interval distribution it actually sees, not a fixed threshold.
Math: assume heartbeat arrival times are normally distributed with mean μ and stddev σ from the recent history. phi(t) = -log10(probability that t since last heartbeat is normal given the distribution). At t = μ + σ, phi ≈ 1; at t = μ + 4σ, phi ≈ 8.
Cassandra uses phi-accrual with a default threshold of 8. Akka cluster uses it too. The win: a flaky high-latency link doesn't set off constant false alarms, and a truly dead node hits the threshold fast.
SWIM
SWIM ("Scalable Weakly-consistent Infection-style Membership", Das/Gupta/Motivala 2002) splits failure detection from membership broadcast and uses gossip to scale. Each node pings a random other node every so often. If no ack arrives in T, it asks K random other nodes to ping the target on its behalf. If none of them get an ack, the target is marked suspect, and the suspicion is gossiped to the cluster.
SWIM runs in O(N) total messages per round no matter the cluster size (each node only pings 1 + K others), and detection time is O(log N) through gossip. HashiCorp Serf uses SWIM; Consul's gossip layer uses Serf; Lifeguard (Schmidt 2018) is a SWIM extension that behaves better during network instability.
Adaptive timeouts
Beyond phi-accrual, modern systems adapt timeouts from observed network behaviour. AWS's Maelstrom paper, Google's gRPC docs, and Envoy's outlier detection all use percentile-based timeouts (set timeout = p99 of observed RTT × multiplier). When p99 climbs, timeouts grow; when it falls, they tighten. That sidesteps the static-tuning trap.
The phi-accrual math, worked out
Phi-accrual is simple statistics under a fancy name. The detector keeps a sliding window (typically the last 1000 inter-arrival times) of how long heartbeats took to arrive. From that window it computes the mean μ and standard deviation σ. Then, given the time since the last heartbeat, it works out the probability that "no heartbeat in this long" fits the historical distribution, and reports the negative log of that probability.
Assume a normal distribution for simplicity (Cassandra uses a more conservative exponential tail, but the shape is similar):
P(no heartbeat in t seconds) ≈ 1 − Φ((t − μ) / σ)
phi(t) = −log10( P(no heartbeat in t) )
Examples with μ = 100ms, σ = 20ms:
t = 120ms → P ≈ 0.16 → phi ≈ 0.8 "fine"
t = 140ms → P ≈ 0.023 → phi ≈ 1.6 "slow"
t = 180ms → P ≈ 0.0002 → phi ≈ 3.7 "concerning"
t = 220ms → P ≈ 1e-7 → phi ≈ 7 "alert"
t = 260ms → P ≈ 7e-12 → phi ≈ 11 "treat as dead"The Cassandra default threshold is phi = 8, which in the example above sits at roughly 8σ past the mean. The number feels reasonable, since a delay that long is vanishingly unlikely if the node and network are healthy. But the real probability depends on the distribution's true tail, which is fatter than normal in real networks. Production tuning is mostly about picking the right threshold for how many false suspicions the workload can tolerate.
The big win over a static timeout is that phi-accrual adapts to the link's real behaviour. A cross-region link with μ = 50ms, σ = 8ms gets a sharp curve; a local-VM link with μ = 1ms, σ = 0.2ms gets an even sharper one. Both reach phi = 8 at "way past expected", but "way past expected" is measured against the observed distribution rather than a hard-coded number.
SWIM in detail — the protocol that scales
SWIM's design solves three problems at once: keep detection time bounded, keep network load bounded, and keep the false-positive rate bounded. The protocol has three phases.
1. Direct ping. Every T_protocol milliseconds, node N picks
a random member M from its membership list and sends a PING. M replies with ACK. If the
ACK arrives within T_ping, M is alive and the round ends.
2. Indirect ping (the clever part). If no ACK arrives, N picks K random
other members and sends each a PING-REQ for M. Each of those nodes pings M directly. If
any of them gets a reply, it relays an ACK back to N. If none do within
T_round, M is marked SUSPECTED.
3. Suspicion broadcast via piggyback gossip. The SUSPECTED state is added to a piggyback buffer that rides on regular SWIM pings (no new messages, just extra bytes in existing ones). Within a few protocol rounds (O(log N) rounds), every node has heard the suspicion. If M responds to anyone during this gossip window, it rejoins as ALIVE. If the suspicion window expires, M is declared DEAD.
The indirect-ping phase is what gives SWIM its low false-positive rate. A dead node fails every direct ping. A node behind a flaky network path only fails the path to N, but K random other nodes likely have working paths to M. A false detection needs N and all K helpers to lack a working path at the same time. That's vanishingly unlikely unless M is truly partitioned or dead.
Hashicorp Memberlist (Serf, Consul) defaults:
T_protocol = 1s // direct ping interval
K = 3 // helpers for indirect ping
T_round = 5s // before SUSPECT
suspect window = 5s // before DEAD declared
Per-node message load:
outgoing: ~1 ping/s + K piggyback bytes per ping
incoming: ~1 ping + 1 ack on average
total cluster-wide: O(N) messages per second, regardless of cluster size
Detection time:
T_round + suspect window + gossip propagation ≈ 10-15 secondsHashicorp's Lifeguard (2018) extended SWIM with two changes that improve behaviour under network stress. Local Health Aware Suspicion uses the local node's recent ack failure rate to scale the suspicion timeout. A node that has been failing its own pings extends suspicion for its peers, on the theory that the local node may be the unhealthy one. Dogpile mitigation cuts the chance of several suspicion broadcasts piling onto the same target. Both ship in Consul and reduce flakiness during real datacenter incidents.
Why the perfect detector is impossible
Chandra-Toueg classified detectors; FLP (Fischer-Lynch-Paterson, 1985) proved one of them can't exist. The theorem: in an asynchronous system with even one possible crash failure, no deterministic algorithm can solve consensus. Since consensus needs some form of accurate failure detection, the perfect failure detector is also impossible in that setting.
The intuition is the same one you hit when reasoning about timeouts. A node that hasn't responded in t seconds may be dead, or may be a millisecond from responding. No finite t reliably tells the two cases apart. The detector is forced to either be incomplete (let some failures go undetected, by setting t too large) or inaccurate ( suspect live nodes, by setting t too small). The Chandra-Toueg classification is exactly the set of trade-offs this corner forces.
Real systems live with this by being honest about the trade-off. Raft uses ◊P semantics: a leader that hears no heartbeats for an election timeout assumes the cluster has lost it and steps down; followers that hear no heartbeats start a new election. Inaccurate detection (a working leader briefly suspected dead) causes leader churn, which is disruptive but recoverable. Cassandra phi-accrual is also ◊P in practice: marked-dead nodes can come back, and the cluster handles it through gossip.
The "machine up" vs "service healthy" gap
Most production failure detection is layered. There's a low-level liveness signal (the machine is responding to pings), a middle-level one (the process is up and responding to TCP), and a high-level one (the application is processing requests correctly). These can diverge in interesting ways:
Machine up, process down. The kernel is healthy; the service died. TCP RSTs on the service port; liveness check fails immediately. Easy case.
Process up, application stuck. The process is alive — accepting TCP connections, ackowledging heartbeats — but its worker pool is fully saturated, its background loop has deadlocked, or a critical thread has wedged. Heartbeats pass but real requests time out. The Kubernetes liveness/readiness split exists for exactly this. Readiness fails before liveness, pulling the pod out of rotation without restarting it; liveness fails only when the process is truly hung.
Application healthy, dependencies down. The service is fine but the database behind it is unreachable. Most teams mistake this for a service problem and page on liveness. Mature observability separates "this service" from "this service's critical dependencies", with different alerts for each.
Picking the right level for the failure detector matters. A consensus protocol watches the lowest level (peer process is up); a load balancer watches the highest (peer can actually serve traffic); a service mesh watches both. Mixing these layers, like using the load-balancer's health-check signal to drive consensus leader election, leads to thrashing during dependency outages.
Detection time vs detection accuracy — the empirical curve
A table of operational defaults across systems, with the trade-off each one reflects:
System Detection time False positive rate Use case
etcd Raft 1 sec low critical state, latency-tolerant
Cassandra phi-8 ~3-10 sec very low eventual consistency, low overhead
Consul Serf ~10-15 sec very low large clusters, gossip
Kubernetes node 40 sec very low production workloads
TCP keepalive ~2 hr (default) extremely low generic network liveness
HAProxy health 2 sec (typical) medium load balancing
Envoy outlier configurable configurable traffic managementThe pattern: faster detection means more thrashing during normal network noise. Critical leadership-election systems push detection times down and accept the occasional unnecessary failover. Large-cluster membership protocols push them up because false positives cascade through gossip and waste cluster-wide work.
The most common production mistake is using a system's default detection time without asking whether it matches the workload. Kubernetes's 40-second node-not-ready-toleration suits stateful workloads on noisy networks; for latency-critical stateless workloads behind a load balancer, evictions could happen much faster, and many teams tune this without seeing the trade-off.
What failure detection cannot fix
Two failure modes that detectors handle badly. Both need system design help rather than better detection.
Grey failure. A node is up, responsive, processing some requests correctly, but failing a subset: a slow disk on one volume, a single dependency unreachable, a bug in one code path. The failure detector sees heartbeats; the application sees errors. The fix is to add per-request error observability (Envoy outlier detection, application-level circuit breakers) so the signal that drives evictions is "this node is failing my requests", not "this node is pinging me back".
Asymmetric partition. A can talk to B, but B cannot talk to A. Both sides' detectors see different worlds. Common with NAT, asymmetric routing, or one-way firewall rules. The fix is bidirectional detection (the protocol must work even when only one direction is alive) plus gossip-based confirmation (other nodes can back up A's view of B).
Both of these are why production systems layer several detection signals and almost never trust a single source. The detector tells you a node might be in trouble; the policy on top of that signal decides what to do about it.
Common mistakes
Treating timeout as "node is dead" rather than "I have not heard from node". The first is a falsifiable claim about the world. The second is honest. Two-phase commit and its descendants made this mistake; Paxos avoided it by treating suspicion as a hint, not a fact. Setting timeouts off the average RTT rather than the tail. Setting Kubernetes node-not-ready-toleration-duration too low and thrashing pods during routine GC pauses on the kubelet. Conflating"machine is up" (PING works) with "service is healthy" (the application is making forward progress). Many production outages come from this gap.
Further reading
- Unreliable Failure Detectors for Reliable Distributed Systems — Chandra & Toueg, 1996. The taxonomy.
- The Phi Accrual Failure Detector — Hayashibara et al, 2004.
- SWIM — Das, Gupta, Motivala, 2002.
- Lifeguard — Schmidt et al, 2018. SWIM under network instability.
- FLP impossibility — the bound failure detectors must work within.
- Gossip protocols — annotated.