12 / 20
Topics / 12

Gossip protocols

Gossip is how clusters of thousands of nodes learn who is alive and what the world looks like, with no coordinator. Each node talks to a handful of random peers every second, and a change started at one node reaches every node in O(log N) rounds with bounded bandwidth. It's how Cassandra, Consul, Riak, and Akka cluster membership stay consistent at scale, and how SWIM and phi accrual replaced the brittle heartbeat-to-a-monitor pattern that coordinator-based clusters used to ship.


Why gossip in the first place

The naive way to detect failure in a cluster is to have every node send a heartbeat to a central monitor every second. With 1000 nodes that's 1000 heartbeats per second hitting one process, and the monitor is both the bottleneck and the single point of failure. The operator's pager goes off the moment the monitor itself flakes.

Gossip flips it around. Every node talks to a small fixed number of peers, usually K=3 or K=4, and shares what it knows about the rest of the cluster. A change at any one node reaches every other node in about logK(N) rounds with high probability. For N=1000 and K=4 that's roughly 10 rounds; at a 1-second gossip interval, full propagation takes about 10 seconds, and no single node carries more than a few KB per second of gossip traffic.

There's no coordinator. There's no SPOF. The protocol scales sub-linearly because the per-node load doesn't grow with cluster size. It depends only on K and the size of the digest exchanged per round.

The mechanics of a round

Every gossip interval, typically once per second, each node runs the same loop:

  1. Pick K random peers from the membership list (usually K=3 or K=4).
  2. Send each peer a digest of the node's state: per-entry versions or vector clocks, not the whole payload.
  3. The peer compares the digest to its own state, asks for anything it's behind on, and sends back anything the sender is behind on.
  4. Merge incoming updates by version. Newer wins.

Digests keep bandwidth bounded. A digest entry is on the order of 20 bytes (node ID + version), so even 1000 entries fit in ~20 KB and most rounds exchange only the diff. Cassandra's per-node gossip traffic on a 100-node cluster is typically under 5 KB/s.

Why O(log N). If every infected node infects K new nodes per round, the infected population roughly multiplies by (1+K) per round until it saturates. The time to reach "all nodes know" is dominated by the slow tail of the last few uninfected nodes. That's where random peer selection matters. A deterministic neighbour list slows the tail by orders of magnitude.

Three flavours of gossip

The literature splits gossip into three patterns. Most production systems use a mix.

FlavourWhat it doesUse case
Anti-entropyFull state reconciliation between two nodes per round. Slow, but eventually corrects every divergence.Background convergence; Cassandra Merkle-tree repair.
Rumor mongeringGossip a single hot event for a fixed number of rounds, then stop. Fast spread, low overhead.Membership changes, configuration updates.
Aggregate gossipCompute global averages (count, sum, mean) by repeated pair-wise mixing. Converges to the true value in O(log N) rounds.Cluster-wide metrics, load estimation.

Anti-entropy is the safety net: slow but complete. Rumor mongering is the fast lane, useful when you need a specific event to spread quickly but don't want it re-gossiped forever. Aggregate gossip is the underrated one; you can compute a cluster's total load without any node knowing the membership list.

SWIM — the scalable failure detector

Das, Gupta, and Motivala, 2002. SWIM (Scalable Weakly-consistent Infection-style process group Membership) is the failure detector half of gossip done right. Every gossip period, each node:

  1. Picks one random peer and sends it a PING.
  2. If the peer replies within the timeout, mark it alive.
  3. If not, pick K other random peers and ask each to PING-REQ the suspect on this node's behalf.
  4. If any of those indirect pings get a reply, the suspect is alive and the direct path was just flaky.
  5. If none reply, mark the suspect as suspect, then dead after a suspicion timeout.
Why K indirect pings. Direct ping failures are mostly transient network issues between two specific nodes: a single dropped packet, a routing blip, one congested NIC. Asking K=3 other peers to confirm cuts the false-positive rate by roughly the cube of the per-link drop probability. A 1% link failure rate becomes a 0.0001% wrong-marking rate.

Failure information piggybacks on the regular PING/ACK traffic. There's no separate channel for it. A new suspect-or-dead status spreads through the cluster as a rumor in O(log N) rounds.

Phi accrual failure detection

Hayashibara, Defago, Yared, Katayama, 2004. Binary alive/dead is the wrong abstraction. Real networks run on a continuum from "looks healthy" to "looks bad". Phi accrual replaces the boolean with a continuous suspicion value phi derived from heartbeat inter-arrival statistics.

Each node tracks the distribution of inter-arrival times for heartbeats from each peer. When a heartbeat is overdue, phi is the negative log of the probability that a heartbeat should have arrived by now. Higher phi means "more suspicious".

phi(t) = -log10(P_later(t - t_last))

# P_later under a normal model:
P_later(x) = 1 - normal_cdf(x, mean, stddev)

# Action thresholds (Cassandra defaults):
phi >= 8     # mark suspect
phi >= 12    # mark dead

The advantage: each application picks its own threshold for what suspicion level triggers a failover. A latency-sensitive service might act at phi=5; a slow batch job might wait until phi=12. The detector itself doesn't decide for everyone. Cassandra and Akka both ship phi accrual; it's the standard for systems that need to adapt to varying network behaviour.

Epidemic broadcast

"Epidemic" is the formal name for what gossip does to a message, after the disease analogy in the 1987 Demers et al. paper. The guarantee is probabilistic, not absolute: with fan-out K and N nodes, the message reaches every node in O(log N) hops with probability close to 1, and the expected per-node load is constant in N.

The trade-off is redundancy. Each node hears the message several times before the cluster settles, typically 3–5 times for K=4. That's the cost of having no coordinator: you can't dedupe without a global view, and you wouldn't want to spend global coordination to avoid a few extra KB.

A rumor mongering scheme caps the redundancy by stopping retransmission after a fixed number of rounds, or once a fraction of peers report "I already have it". The Demers paper analyses several variants; production systems usually pick one and hard-code the parameters per workload.

What ships in production

SystemFan-out / intervalFailure detectorNotes
CassandraK=3, 1 sPhi accrualAnti-entropy gossip every second, exchanging endpoint state versions.
Consul / SerfK=3, 1 s (LAN) / 30 s (WAN)SWIM + LifeguardHashiCorp's SWIM implementation with Lifeguard suspicion timeouts.
RiakRing gossip, ~1 minHeartbeat + ring claimDynamo-style; gossip for ring (consistent-hash) membership.
Akka clusterK=1–3, 1 sPhi accrualGossip convergence with CRDT-based membership and version vectors.
Hyperledger FabricConfigurable, ~1 sHeartbeat + leader-basedGossip layer disseminates blocks and transactions across orgs.

A simplified version of the inner loop used by most of the above:

# Every gossip_interval (typically 1s):
for each tick:
    peers = pick_random(membership, k=3)
    for peer in peers:
        digest = build_digest(local_state)        # versions only
        send(peer, GOSSIP_SYN, digest)

# On receiving a SYN:
def on_syn(sender, their_digest):
    needed_from_me = diff(local_state, their_digest)
    needed_from_them = missing(their_digest, local_state)
    send(sender, GOSSIP_ACK, needed_from_me, needed_from_them)

# On receiving an ACK:
def on_ack(sender, updates, requests):
    merge(local_state, updates)                   # higher version wins
    send(sender, GOSSIP_ACK2, fetch(requests))

Convergence math

The operational sweet spot for most clusters: fan-out K=4, gossip interval 1 second, digest size ~1 KB. At those settings:

Cluster size NRounds to full propagationTimePer-node bandwidth
100~7~7 s~4 KB/s
1,000~10~10 s~4 KB/s
10,000~14~14 s~4 KB/s
100,000~17~17 s~4 KB/s

The per-node bandwidth is roughly constant because each node still only gossips with K=4 peers, whatever the cluster size. Convergence time grows logarithmically: going from 1,000 to 100,000 nodes adds only ~7 seconds. That's the property that makes gossip work at scale.

Push, pull, and push-pull

Two peers can exchange state in three ways, and the choice matters:

  • Push. The sender pushes its updates to the receiver. Fast for spreading new events, since the originator broadcasts as soon as it learns something, but slow nodes that fell behind don't catch up unless someone happens to push to them.
  • Pull. The receiver asks for updates. Great for slow nodes catching up, but new events take an extra round-trip to spread because the originator has to wait to be asked.
  • Push-pull. Both directions in one exchange. Push spreads new events; pull catches up stragglers. The combined latency tracks the worse of the two, but the convergence guarantee is much tighter.

Most production systems run push-pull. The Demers paper showed analytically that push-pull halves the number of rounds needed for full convergence compared to either push-only or pull-only. Worth the extra payload per round.

Where gossip goes wrong

  • False positives during GC pauses. A 5-second JVM stop-the-world pause looks identical to a dead node to a heartbeat-based detector. Phi accrual helps because it widens the expected inter-arrival distribution when the network gets jittery; a fixed-threshold detector will mark nodes dead in bulk. Cassandra has war stories about cluster-wide "everyone is dead" cascades from G1 pauses.
  • Gossip storms when nodes flap. A node that toggles alive/dead every few seconds creates a wave of gossip messages on every flap. Lifeguard's contribution (Dadgar, 2018) is a self-adjusting suspicion timeout that lengthens when the cluster's recent suspicion history is high, slowing the rate of mark-and-unmark when the cluster is unhealthy.
  • Small-world topologies. If the random peer selection isn't actually random, say it's biased by rack, datacentre, or hash bucket, convergence slows sharply and the worst-case tail of "last node to hear the rumor" blows up. Always use uniform random selection across the full membership.
  • Membership lists that grow without bound. Crashed nodes have to be evicted from the gossip membership eventually, or the cluster will keep trying to gossip to ghosts. Most systems use a "dead for 72 hours, then remove" policy.

Further reading

Found this useful?