8 min read · Guide · Distributed systems
How it works · Distributed systems

Service discovery, the address book.

Instances come and go. The address you hardcoded yesterday belongs to someone else today. Discovery is what your services use to find each other in a world where nothing stays still.

Parts01–10 InteractivePattern picker PrereqDNS · TCP

What is service discovery?

Endpoints, in motion.

Service discovery is how services find each other in a dynamic environment where instances come and go. The two patterns: client-side discovery (client queries a registry then load-balances) and server-side discovery (a load balancer fronts the service, hiding instance changes). DNS is the oldest registry; Consul, etcd, ZooKeeper, Eureka, Kubernetes Services are the modern ones.

In a static world, you write api.example.com in your config and call it forever. In a dynamic world — autoscaling groups, container orchestrators, blue-green deploys, spot fleets, rolling restarts — instances appear and disappear constantly. Every restart gets a new IP. The hardcoded list goes stale in minutes.

Service discovery is the missing layer: an authoritative registry of "who is currently up", and a way to ask it. The interesting questions are who maintains the registry, who reads from it, and how stale the answer is allowed to be.


Client-side vs server-side: who does the lookup

Whoever does the lookup shapes the whole design.

The pattern you choose determines deployment complexity, language portability, and how much logic lives in the client vs the platform load balancer.

Pattern 01 · Client-side discovery

The caller picks the instance.

The client queries the registry, receives the list of healthy instances, and load-balances among them itself. Library-level — Eureka with Ribbon, Consul with smart clients. Fast (no extra hop) but every language needs the client logic.


Self-registration vs third-party registration

Self-register or be registered.

Two flavours. Self-registration: the instance announces itself on startup ("I am here, IP 10.0.0.42, port 8080") and re-announces (heartbeats) periodically. Eureka, Consul agents, mDNS all work this way. Simple; the workload knows about the registry.

Third-party registration: a separate process watches infrastructure events and registers instances on the workload's behalf. Kubernetes does this — the kubelet reports pod status to the API server; the endpoints controller writes them into the Service's endpoint set. The workload doesn't know about discovery at all.


Health checks come in three levels of "alive"

Three levels of "alive".

Liveness, readiness, and startup probes all answer different questions. Kubernetes and most modern systems split them on purpose; older systems collapse them and confuse load with crashed.

liveness

Should it be killed?

Has the process wedged? Failure → the orchestrator restarts it. Cheap endpoint, no dependencies. Don't probe DBs from here.

readiness

Should it get traffic?

Is it warmed up, dependencies reachable, queues drained? Failure → removed from the load balancer's pool but kept alive. Probe DBs and downstreams here.

startup

Has it finished booting?

Slow-to-start workloads (legacy JVMs, ML model loads). Suppresses liveness/readiness until startup passes — avoids restart loops on slow boots.


DNS is the oldest service registry, still in use

The oldest registry, still in use.

The simplest discovery mechanism is DNS. Configure A records for each instance; clients resolve a name and get a list. Cheap, universal, no new infrastructure. The downside is TTL: caches stale records on clients for whatever the TTL says, even after the instance is gone.

SRV records add port/weight/priority, but most systems do not use them. DNS-SD (RFC 6763, used by Kubernetes for headless services) works around the TTL problem with very low TTLs and authoritative servers tied to the orchestrator. The pattern still has a long tail of caching surprises — JVM does its own DNS cache, glibc has nscd, browsers have their own.

# Kubernetes headless service — DNS round-robin over endpoints
$ dig +short orders.default.svc.cluster.local
10.244.0.5
10.244.0.6
10.244.0.7

# 5s TTL means clients see new endpoints quickly, at the cost of more queries

The modern registries: Consul, etcd, ZooKeeper, Eureka

Consul, etcd, Zookeeper, Eureka.

Consul (HashiCorp): DNS interface plus an HTTP API; gossip-based agent network; multi-datacenter primitives. Excellent in heterogeneous fleets.

etcd (CoreOS, now CNCF): the registry behind Kubernetes Services. Watch-based; you subscribe to a key prefix and get a stream of changes. Strong consistency, Raft-replicated.

Zookeeper (Apache): older, the original "coordination service". Low-level primitives — sequential nodes, ephemeral nodes — but heavier to operate. Kafka used it for years; many newer systems have moved off it (KRaft).

Eureka (Netflix OSS): designed for AP — happy to return slightly stale data during partitions if it keeps clients running. Pairs with Ribbon on the client.


How CAP shows up in your service registry

A registry is a distributed system, so it picks consistency or availability.

A registry is a distributed system; CAP applies. Strong-consistency registries (etcd, Zookeeper) refuse to return stale data during partitions — clients fail closed. AP registries (Eureka) keep returning the last known list — clients fail open and might call dead instances.

The right pick depends on your topology. For Kubernetes etcd, AP would corrupt the control plane; CP is mandatory. For service discovery in user-facing traffic, sometimes "slightly stale" is dramatically better than "no answer" — and Eureka's bet is exactly that.


The deploy that left dead IPs in rotation

Staleness is not a footnote. Here it is, minute by minute.

A composite failure, assembled from several real postmortems. An orders service runs six instances behind nginx. The upstream block is rendered by consul-template: when the Consul catalog changes, the template re-renders and nginx reloads. A routine afternoon deploy rotates the autoscaling group — six new instances up, six old ones down.

Four of the old instances shut down cleanly: the SIGTERM handler deregisters from Consul before exit, and they leave the catalog immediately. Two do not — the orchestrator gives them ten seconds and then kills them hard, before the deregistration call goes out. The registry still lists them as passing. The health check runs every ten seconds and needs three consecutive failures before marking an instance critical, so for roughly thirty seconds the catalog is confidently wrong: two of the "healthy" IPs answer nothing.

Now the consumers. consul-template notices the change, waits out its quiescence window (a few seconds, to batch flapping), renders the new upstream block, and reloads nginx — five to fifteen seconds from change to traffic, for the entries the registry knows about. The two dead-but-passing instances stay in the rendered config until the health check catches up, so nginx keeps sending a quarter of requests to IPs that refuse connections. With proxy_next_upstream configured, those requests retry another backend and users see a latency blip. Without it, they see 502s.

Clients that resolve the registry's DNS directly do worse. The records carry a low TTL, but the JVM's resolver caches positive lookups for thirty seconds by default (indefinitely on some older defaults), and there may be an nscd or node-local DNS cache in between. Their staleness window is the registry's window plus every cache on the path.

An Envoy-style consumer behaves differently in both directions. Endpoints arrive by xDS push, so propagation is sub-second instead of render-and-reload, and outlier detection ejects the dead endpoints after a few consecutive failures without waiting for the registry at all. The caveat is the panic threshold: if too large a fraction of a cluster looks unhealthy at once — which is exactly what a fast rolling deploy looks like — Envoy assumes the health data itself is broken and routes to all endpoints, dead ones included.

The takeaway: staleness is a budget, and it sums. Health-check interval times failure threshold, plus propagation to each consumer, plus reload time, plus client DNS caches. Compute the sum and you know your worst minute. Then shrink the largest term — deregistering on SIGTERM beats faster health checks, and retry-on-connect-failure beats both.

# Staleness budget for one hard-killed instance, worst case
 0s   instance killed; deregistration never sent
30s   3 x 10s health checks fail -> marked critical in the catalog
35s   consul-template quiescence elapses, template re-renders
36s   nginx reload picks up the new upstream block
66s   JVM-cached DNS answers expire on direct-resolving clients
      -> over a minute of connection-refused, from one instance

Service registries compared: Consul, etcd, ZooKeeper, Eureka, Kubernetes

Five registries, five trade-offs.

Consul · HashiCorp 2014
Service-discovery-first. CP via Raft. Built-in health checks, multi-datacenter replication, ACL system, DNS interface. The most full-featured of the dedicated registries; HashiCorp's commercial tier (Consul Enterprise) adds federation, namespace isolation, network segmentation. Used by SoundCloud, Roblox, Spaceship.
etcd · CoreOS 2013, CNCF 2018
The control plane datastore behind Kubernetes. CP via Raft. Lower-level than Consul — no built-in health checks, no DNS interface; you build service discovery on top. Performance-tuned for the watch protocol (5,000+ writes/sec, millions of reads). Defaults to a single Raft group per cluster, recommended max ~50 nodes.
Apache ZooKeeper · 2008
The grandparent of the modern registry world. CP via ZAB (Paxos-family). Used by Kafka, HBase, HDFS, Solr, Druid for cluster coordination — not so much for direct service discovery anymore. Still in production at LinkedIn, Yahoo, Twitter (where it was famously deployed early). API is harder to use correctly than Consul or etcd; clients are lower-level.
Netflix Eureka · 2012
The famous AP registry. Returns the last-known list during partitions; clients fail open. Not actively developed since Netflix's own move to envoy-based service mesh; still ships with Spring Cloud. The historical case study for "AP is the right answer for some user-facing service discovery."
Kubernetes Services + CoreDNS
If you're already on Kubernetes, you have service discovery for free. The Service object is the registry; etcd is the durable store; CoreDNS resolves names; kube-proxy programs the kernel for routing. No additional registry to operate.

Picking by ecosystem. Already on Kubernetes? Use Services. Need cross-platform discovery (VMs + containers + bare metal)? Consul. Building on Kafka or one of its cousins? You already have ZooKeeper or KRaft. Building net-new on the JVM with Spring? Eureka still works. Need raw, low-level coordination primitives for a custom system? etcd directly.


How three real companies do service discovery

What real shops use.

Netflix — Eureka, then service mesh. Eureka was Netflix's own service discovery for over a decade — the published reason for choosing AP was "during a region failure, returning a stale instance is better than returning nothing." Around 2018-2020 Netflix migrated most internal traffic to a service-mesh model (custom Envoy data plane), where service discovery happens via xDS push from a central control plane rather than a per-instance lookup.

Lyft — Envoy + xDS. Lyft (~5,000 microservices) runs Envoy as a per-host proxy with xDS-based service discovery. The control plane holds the registry and pushes endpoint updates to every Envoy as the cluster changes. The architectural rationale: a service-mesh data plane already sits in the request path, so making it the discovery client is "free."

Airbnb — SmartStack, then service mesh. Airbnb's home-grown SmartStack was an early (~2013) implementation of client-side discovery via local HAProxy and Synapse. They migrated to Istio + Envoy in the late 2010s. The published lesson: client-side discovery libraries inevitably reinvent service-mesh features (retries, circuit breakers, observability) — once you've built that, you have a service mesh anyway, but worse.



A closing note

In Kubernetes most of this is invisible — the Service is the abstraction, the controller is the registrar, kube-proxy or the CNI does the rest. It's still worth knowing the names underneath because the day something goes wrong, you'll be debugging Endpoints, EndpointSlices, kube-proxy iptables rules, and CoreDNS — and at that point the labels matter.

Found this useful?