Sub-page 06 · for operators + infra engineers

Kubernetes internals · etcd

One key-value store,
three nodes, no compromises.

Every object you have ever applied to a Kubernetes cluster (every Pod, every Secret, every Lease, every CRD instance) is, at the bottom of the stack, a protobuf-encoded value at a registry path inside a distributed key-value store called etcd. The api-server is its only legitimate client. Lose etcd and you have lost the cluster's mind.

This page is the operator-grade tour: Raft, MVCC, watch streams, leases, compaction, the practical 1.5 GB ceiling, snapshot and restore, encryption at rest, and the 3-vs-5 master decision. Roughly 4,400 words. Pair it with the architecture sub-page for where etcd sits in the larger picture, and the read/write quorum simulator for the intuition behind why three is the smallest useful number.

Why etcd, when there were so many alternatives.

Step into the Kubernetes design discussions of 2014 and you will find that the choice of storage backend was not obvious. The team needed a place to put cluster state: a few hundred megabytes of small objects, mutated thousands of times per minute, watched by every component in the system. PostgreSQL was the most mature option. ZooKeeper had a decade of production at scale. Consul was new but credible. CockroachDB was just being designed. Cassandra existed and people were using it for similar workloads. The team picked etcd, which at the time was a young project from CoreOS with one user and a five-page design doc, and the choice has aged extremely well. It is worth understanding why.

The first filter is consistency. Kubernetes' control loop pattern depends on optimistic concurrency: a controller reads an object, modifies it, writes it back with the previous revision number, and trusts the storage layer to reject the write if anyone else got there first. This is unworkable on an eventually-consistent store. Cassandra is out. DynamoDB without transactions is out. Anything where two readers can see different versions of the same key for more than a few milliseconds is out. You need linearisable reads and writes (a single global ordering of operations that every client agrees on), and there are essentially three families that give you this: Paxos-derivative systems (Spanner, Chubby), Raft-based systems (etcd, Consul), and traditional single-master SQL (Postgres with synchronous replication).

The second filter is the watch model. Kubernetes is built on watches; every controller is a program that subscribes to a stream of state changes and reacts. PostgreSQL has logical replication and LISTEN/NOTIFY, but neither was designed for fan-out to thousands of clients each holding long-lived connections, and the LISTEN payload limits make it awkward for object-shaped events. ZooKeeper has watches but they are one-shot (fire once and you have to re-register), which forces the client to re-list on every event, a quadratic disaster at Kubernetes scale. etcd's v3 API has streaming watches with explicit ordering by revision number, exactly the primitive Kubernetes needed.

The third filter is operational simplicity. Kubernetes wanted a backend that could be packaged as a single static binary, run in a container, configured by flags, and operated by a small team. Postgres is a wonderful database but its operational footprint (backups, replication, failover, role management, extension management, vacuuming) is a full-time job. ZooKeeper is JVM and comes with all the tuning that implies. Consul bundles service discovery features Kubernetes did not need. etcd is a static Go binary. You start it with a few flags, point its peers at each other, and it runs. The bet was that operational simplicity in 2014 would still matter when clusters had grown a hundred-fold, and that bet has paid out.

The fourth filter, less talked about, is licensing and governance. CoreOS donated etcd to the CNCF early; it is Apache 2.0, no contributor agreement traps, no rug-pull risk. CockroachDB switched to a non-OSS licence in 2019. ZooKeeper is Apache and fine, but its development cadence had slowed. Long-term governance matters when you are about to make a project the universal dependency of an entire industry's infrastructure. etcd's choice locked in the right shape.

The trade you accept by choosing etcd: small total dataset (gigabytes, not terabytes), small value sizes (kilobytes, not megabytes), and a hard ceiling on QPS that scales with disk fsync latency, not with cores. Kubernetes lives well inside those limits. If your application does not, do not put it in etcd; put it in a real database and store a pointer in etcd.

Raft, in five minutes: leader, log, commit.

Raft is the consensus protocol etcd uses to keep three (or five, or seven) replicas of the same key-value store in agreement about every write. Diego Ongaro and John Ousterhout published the paper in 2014 with the explicit goal of being more understandable than Paxos, and the goal was achieved: a working Raft implementation fits in maybe two thousand lines of Go. You can read the whole protocol once and feel like you understand it. The full subtlety lives in the corner cases, but the headline ideas are three: there is always exactly one leader, the leader is the only one who appends to the log, and a log entry is committed once a quorum of peers have written it to their own log.

The protocol runs as a state machine where each node is either a Follower, a Candidate, or a Leader. Followers start a randomised election timeout (typically 150 to 300 ms) and if they do not hear from a leader within that window they promote themselves to Candidate, increment a monotonically-rising term number, vote for themselves, and ask every peer for a vote. A peer grants a vote at most once per term, and only if the candidate's log is at least as up-to-date as the peer's own. The first candidate to collect votes from a quorum becomes the Leader for that term. Leaders send heartbeats every 50 ms or so to suppress further elections. Lose the leader, the timeout fires, a new election runs, you are back to a stable leader within a few hundred milliseconds.

Once a leader exists, all writes go to it. Clients that hit a follower are redirected. The leader appends every write to its own log, replicates the entry to every follower over the peer connection (etcd uses port 2380 for this), and waits for a quorum of followers to acknowledge before marking the entry committed. Committed entries are then applied to the local state machine (for etcd, this means putting the value into the MVCC tree) and a response is returned to the client. The safety property Raft guarantees is that once an entry is committed, no future leader will ever overwrite it; a candidate whose log is missing committed entries cannot win an election, because peers refuse to vote for it.

The thing operators most often get wrong about Raft is the meaning of quorum. Quorum is the smallest majority of a static membership. In a 3-member cluster, quorum is 2. In a 5-member cluster, quorum is 3. The cluster cannot make progress (cannot accept writes, cannot elect a leader) without quorum. This is why even-numbered cluster sizes are pathological: a 4-member cluster needs 3 to make quorum, exactly the same as a 3-member cluster, but has a larger attack surface (any one of four can fail you), more network chatter, more write latency. There is no operational reason ever to run an even number of etcd members. Always 3. Sometimes 5. Rarely 7.

Raft's safety properties are sometimes summarised as five invariants: election safety (at most one leader per term), leader append-only (a leader never overwrites or deletes log entries), log matching (if two logs share an entry at index i with the same term, all entries up to i are identical), leader completeness (committed entries appear in the log of all future leaders), state machine safety (if any node has applied an entry at index i, no other node ever applies a different entry at i). The proof these all hold simultaneously is most of the original paper. The implication for you as an operator is that, given a healthy quorum, etcd will never lose an acknowledged write, and never serve stale data on a linearisable read.

Watch the protocol with your own eyes by setting --logger=zap --log-level=debug on a small etcd cluster and pulling the leader's plug. You will see the followers' election timers fire (roughly 150-300ms after the last heartbeat), a candidate emerge, the term number bump, votes flow, and a new leader assume the role within a second. The api-server, watching etcd, will see exactly one transient context deadline exceeded on the in-flight write, retry, and continue. Most of the time you do not even notice the failover happened.

Hard rule: Raft tolerates floor((N-1)/2) failures. With N=3 that is 1 failure. With N=5 that is 2. The read/write quorum simulator and the CAP theorem simulator let you see this graphically. Crash one of three, the cluster keeps running. Crash two of three, the api-server starts returning 5xx for writes within seconds.

MVCC: every key, every revision, indexed.

Underneath the consensus layer, etcd's storage is a multi-version concurrency-control tree. Every write is tagged with a globally-monotonic revision number, and the previous version of the key is not overwritten. It is kept around, alive, until compaction explicitly retires it. Reads can ask for the current state or for the state as of some past revision; watches can subscribe from any past revision and replay every change forward. Internally the storage is a B-tree indexed not by raw key but by the pair (key, mod_revision), which is what gives etcd both linearisable reads and a complete change-log surface.

Three revision-shaped numbers travel with every key. The create_revision is the revision at which the key was first written; it never changes for the lifetime of the key. The mod_revision is the revision of the most recent put; this is what Kubernetes surfaces as resourceVersion on every object. The version is a per-key counter that increments on every put and resets to zero when the key is deleted and recreated. Together they let you reason about causality: a controller that reads a key at mod_revision = 487293 and writes it back can include that number in an etcd transaction, and etcd will refuse the write if the key's current mod_revision is anything else.

That last sentence is the foundation of optimistic concurrency in Kubernetes. When kubectl tells you "the object has been modified; please apply your changes to the latest version and try again", what has happened is that the api-server's etcd transaction included a precondition mod_revision == X, etcd's MVCC layer saw the current value was actually X+1, the transaction failed, and the api-server returned a 409. The whole thing happens in one round trip and never holds a lock. The price you pay is that controllers must be ready to retry; the upside is that the system never deadlocks and never serialises everything through a single lock manager.

The MVCC tree is also what makes watches efficient. When the api-server opens a watch on /registry/pods/ with start_revision=487300, etcd does not need to re-list the entire pods prefix; it walks the MVCC tree from revision 487300 forward and streams every put or delete that touched a key under that prefix in order. The watch is essentially a tail of the change log, bounded only by what compaction has retired. This is why watches in Kubernetes are cheap and why every controller can afford to keep one open: the work etcd does per watcher per change is proportional to the change, not to the size of the dataset.

# Inspecting an object's revisions directly with etcdctl
$ ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt \
    --key=/etc/kubernetes/pki/apiserver-etcd-client.key \
    get /registry/pods/prod/web-7d8 -w json | jq '.kvs[0]'
{
  "key": "L3JlZ2lzdHJ5L3BvZHMvcHJvZC93ZWItN2Q4",    # base64 of the path
  "create_revision": 487291,
  "mod_revision":    487293,    # this is the resourceVersion you see in kubectl
  "version":         3,
  "value": "\\x00\\x00k8s\\x00\\nv1\\x12Pod..."   # protobuf-encoded Pod
}

# Reading a key as it was 200 revisions ago — only works before compaction
$ etcdctl get /registry/pods/prod/web-7d8 --rev=487293 -w json
# Returns the value at that revision, with version=2.

Two operational consequences flow from MVCC. First, etcd's disk usage grows with the change history, not with the live key count. A cluster with 10,000 pods and one mutating controller per pod can produce a hundred million revisions in a week. Without compaction, the database file grows linearly. Second, expensive list-from-the-past queries (the kind a forensic analyst might want) are bounded by what compaction has not yet erased. By default Kubernetes' kube-apiserver is configured to auto-compact every five minutes, retaining roughly an hour of history.

Worth internalising: the resourceVersion Kubernetes shows you is not per-object. It is a global cluster-wide counter. A Pod with resourceVersion: 487293 and a Deployment with resourceVersion: 487290 tells you, definitively, that the Pod was written after the Deployment. This is why watch-from-resourceVersion works.

Watch streams: the change-log every controller subscribes to.

A watch in etcd is a long-lived gRPC stream. The client opens a single bidirectional stream and sends a WatchCreateRequest for one or more key ranges, optionally with a starting revision. etcd replies with a stream of WatchResponse messages, each carrying zero or more Event records (PUT or DELETE), in revision order, until the client cancels or the stream breaks. Multiple watches multiplex over the same TCP connection. A well-behaved client opens one connection to etcd and drives all its watches over it.

The api-server uses watches in two layered ways. The first is the obvious one: the api-server itself opens a watch on every resource type (pods, services, deployments, leases, and so on) to feed an in-memory cache of recent state. The second is more subtle. Each Kubernetes client that opens a watch against the api-server is not, in general, getting an etcd watch directly; it is getting a watch from the api-server's cache, which is itself sourced from etcd. The api-server fans out a single etcd subscription to thousands of client watches. This is why a cluster with 10,000 controllers does not put 10,000 watches on etcd; it puts roughly one watch-per-resource-type, and the api-server multiplexes.

Bookmarks are the device that keeps long watches resumable. By default, etcd does not send anything to a watcher when nothing relevant has changed. If your watch sits idle for an hour, the most recent revision number you saw might be far behind the cluster's current revision, and if the connection breaks, you do not know how to resume cleanly. The WatchProgressRequest primitive solves this by asking etcd to emit a revision-only checkpoint event periodically. Kubernetes surfaces these as BOOKMARK events on the wire, with a header revision but no object body. A controller that processes bookmarks updates its resourceVersion cursor and, on reconnect, resumes from a much more recent point, usually short enough that the api-server's watch cache still has the events.

When the cache does not have the events (the cluster is too busy, the cache window has rotated past your cursor, the network was out for too long) you get the 410 Gone response. The contract on 410 is unambiguous: throw away your local cache, do a full List with no resourceVersion argument to get the current global revision back, then open a new Watch from there. Every well-written controller is built around this relist-on-410 cycle. If your custom controller does not handle 410 correctly, it will silently drift after the first network blip; the symptom is "the controller eventually stops reconciling new changes" and the cause is almost always a swallowed 410.

# A raw watch from kubectl, showing wire-level events
$ kubectl get pods --watch -o json --output-watch-events=true \
    --resource-version=487291

{"type":"ADDED","object":{"kind":"Pod","metadata":{"name":"web-7d8","resourceVersion":"487292"...}}}
{"type":"MODIFIED","object":{"kind":"Pod","metadata":{"name":"web-7d8","resourceVersion":"487293"...}}}
# … several seconds idle, then a BOOKMARK
{"type":"BOOKMARK","object":{"kind":"Pod","metadata":{"resourceVersion":"487310"}}}
# Your client should record 487310 even though no Pod actually changed.
{"type":"DELETED","object":{"kind":"Pod","metadata":{"name":"web-7d8","resourceVersion":"487420"...}}}

Two performance subtleties bite controller authors. First: a watch on /registry/pods/ with no namespace selector at scale is expensive, because every pod change anywhere in the cluster fires the watch. Use field selectors (spec.nodeName=N3, metadata.namespace=ns) to scope down. The api-server's watch cache uses indexes for the common selectors and avoids scanning. Second: watches consume goroutines on the api-server. A noisy informer that opens and closes a watch every second piles up state. The remedy is the SharedInformer pattern from client-go, which holds exactly one watch per resource type per process and broadcasts events to all in-process consumers.

The watch cursor you keep is the cluster's resourceVersion at the moment of the most recent event you successfully processed — not at the moment you started processing it. If your handler crashes mid-processing, you re-deliver from the cursor you last committed. Idempotent reconciliation handles this naturally; non-idempotent code does not, and is therefore wrong.

Compaction, defragmentation, and the 1.5 GB ceiling.

etcd's MVCC tree never overwrites; every write is a new node tagged with a new revision. Without something to retire old revisions, the tree grows without bound, and a long-running cluster will eventually hit etcd's --quota-backend-bytes ceiling and start rejecting writes with mvcc: database space exceeded. That ceiling defaults to 2 GiB and is widely raised to 8 GiB in production, but the practical operating ceiling (the size at which your cluster is still healthy) is much lower. Most well-tuned Kubernetes etcd databases stay under 1.5 GB. Above that, compaction takes too long, defrag windows become disruptive, and snapshot transfers during member replacement creep over the keepalive timeout.

Two operations work in tandem to keep the database small. Compaction tells etcd it can throw away every revision older than some specific number, deleting the tombstones and the old key versions from the MVCC index. After compaction, those revisions can no longer be queried (a RANGE rev=X against a compacted X returns ErrCompacted) and the watch cache cannot serve resume requests from before the compaction point. The space, however, is not yet returned to the filesystem; the database file remains the same size, with the freed pages now marked free. Defragmentation is the second step: it walks the database file, copies the live pages into a new compact layout, and shrinks the file. Defrag is offline for the member being defragged (that member cannot serve reads or writes for the duration) so it must be run as a rolling operation, one member at a time, across a 3-or-5-node cluster.

Kubernetes configures the api-server to call etcd's Compact RPC automatically every five minutes, with a retention of roughly an hour by default (controlled by --etcd-compaction-interval). This auto-compaction handles the steady state. What it does not handle is defrag, which has never been auto-run because it is offline-per-member and an upgrade or operator misconfiguration could turn rolling defrag into a quorum loss. Operators run defrag on a schedule, often weekly, and most modern installers (kubeadm, kops, the major managed offerings) ship a CronJob or a systemd timer that does the rolling work safely. If you are running a self-built control plane, you must do this yourself; symptoms of skipping it include the database file growing past 4 GB while the live key count stays small.

The 1.5 GB practical limit comes from the interaction of three constraints. The boltdb backend etcd uses caches its B-tree pages in memory, and at sizes above two gigabytes the page cache becomes a meaningful fraction of the host's RAM, with corresponding GC pressure on the etcd process. Snapshot transfers during member replacement scale linearly with database size and default to a 5-minute deadline; a 4 GB database on a slow network will hit it. And, perhaps most importantly, recovery time after a leader crash is dominated by the new leader replaying its WAL into memory, which is also linear in size.

# Inspecting database health
$ etcdctl endpoint status -w table --cluster
+--------------------------+------------------+---------+---------+-----------+----+
|         ENDPOINT         |        ID        |  VERSION| DB SIZE |  IS LEADER| RAFT TERM
+--------------------------+------------------+---------+---------+-----------+----+
| https://10.0.1.10:2379   | 8e9e05c52164694d | 3.5.13  | 1.4 GB  |   true    | 18
| https://10.0.1.11:2379   | 91bc3c398fb3c146 | 3.5.13  | 1.4 GB  |   false   | 18
| https://10.0.1.12:2379   | fd422379fda50e48 | 3.5.13  | 1.4 GB  |   false   | 18
+--------------------------+------------------+---------+---------+-----------+----+

# Force a compaction to revision 487300 (everything older is purged)
$ etcdctl compact 487300

# Rolling defragmentation — one member at a time, never two
$ for ep in 10.0.1.10 10.0.1.11 10.0.1.12; do
    etcdctl --endpoints=https://$ep:2379 defrag
    sleep 30   # let the cluster re-stabilise before the next member
  done

# If you skipped compaction for too long, the database can be twice live size
# Defrag is the only way to shrink the file

Two operational gotchas to internalise. First, never run defrag on more than one member at a time. A 3-member cluster doing two simultaneous defrags has no quorum for the duration, and every write blocks; this has caused multi-minute control-plane outages in production. Second, if you ever see database space exceeded in etcd's logs, compaction is necessary but not sufficient. Defrag is required to actually free space, and the cluster is rejecting writes the entire time. The recovery is to manually compact to the current revision, then defrag each member in sequence, then verify the database file is back below the quota.

A lesser-known knob: --auto-compaction-mode=revision with --auto-compaction-retention=10000 tells etcd to compact every time the revision number advances by ten thousand, regardless of wall clock. On very write-heavy clusters this is more predictable than the time-based default, because it bounds the database size proportionally to writes-since-last-compaction.

Leases: TTLs that bind keys to liveness.

A lease in etcd is a primitive almost always misunderstood. It is not a lock. It is not a lock-with-timeout. It is a TTL handle that you can attach to one or more keys, such that when the lease expires (because the holder failed to keep it alive) every key bound to it is automatically deleted. You create a lease with a TTL, you put keys with --lease=ID, and you keep it alive with periodic KeepAlive RPCs. Lose the holder, lose the leases. Lose the leases, lose the keys. This is exactly the primitive Kubernetes uses to express liveness for several cluster-internal mechanisms.

The most visible use of leases in Kubernetes is leader election. Every leader-elected component (kube-controller-manager, kube-scheduler, the cloud-controller-manager, every operator that runs in HA mode) uses a Kubernetes Lease object (the coordination.k8s.io/v1 resource, distinct from but conceptually similar to the underlying etcd lease primitive). The candidate writes its identity into the Lease, sets a leaseDurationSeconds, and updates the renewTime field every RenewDeadline / 2. If it stops renewing, other replicas notice the renewal stopped and challenge for the lease after the duration elapses.

The second use, less visible but more pervasive, is node heartbeats. Pre-1.13 every kubelet wrote the entire Node status object to the api-server every ten seconds, kilobytes of YAML, parsed and persisted, just to say "I am still alive". The Lease-based heartbeat replaced this with a tiny Lease object per node in the kube-node-lease namespace, updated every ten seconds with a single timestamp field. On a 5,000-node cluster the difference was an order of magnitude in api-server write QPS; the upgrade is one of the largest scalability wins Kubernetes has ever shipped.

The third use is endpoint membership for headless Services and the Endpoints/EndpointSlice machinery, where leases tie the existence of an EndpointSlice entry to the liveness of the controller that produced it. If the EndpointSlice controller crashes and never comes back, its slices expire and stop directing traffic to dead pods. The same pattern applies to the ServiceAccountTokenRequest controller, the resource quota controller's reservation tracking, and a dozen smaller subsystems.

# Lease lifecycle from the etcdctl perspective
$ etcdctl lease grant 30
# lease 326963a02758b203 granted with TTL(30s)

$ etcdctl put /myapp/leader node-A --lease=326963a02758b203
# OK

$ etcdctl get /myapp/leader
/myapp/leader
node-A

# Keep the lease alive — runs forever, sending KeepAlive every TTL/3
$ etcdctl lease keep-alive 326963a02758b203
# lease 326963a02758b203 keepalived with TTL(30)
# lease 326963a02758b203 keepalived with TTL(30)
# …

# If the holder process dies and stops calling keep-alive,
# 30 seconds later etcd auto-deletes /myapp/leader.

# Inspecting the Kubernetes leader-election lease
$ kubectl -n kube-system get lease kube-controller-manager -o yaml
spec:
  holderIdentity:       controller-manager-7d8x9_8a3f...
  leaseDurationSeconds: 15
  acquireTime:          "2026-05-03T10:14:22.000Z"
  renewTime:            "2026-05-03T10:18:51.000Z"

The subtlety that bites people is the difference between a Kubernetes Lease (the API object) and an etcd lease (the storage primitive). The Kubernetes Lease object is just a regular API resource (it lives in etcd as protobuf, like every other object) but it does not actually use the underlying etcd lease primitive. The TTL semantics are implemented at the api-server layer: a controller updates renewTime periodically, and the lease-acquisition code on the other replicas inspects the timestamp and decides whether the lease has expired. This means the Kubernetes Lease's failure detection is bounded by the api-server, the network between candidates and the api-server, and the controllers' own wall-clock skew — not by etcd's TTL machinery directly.

The deep etcd lease primitive does get used in one place: by the api-server's ServiceAccountToken bound-token mechanism, where a token's validity is tied to the existence of the requesting Pod via an etcd lease. Delete the Pod, the lease releases, the token-binding reflection of it expires.

If your custom operator implements its own leader election, do not write your own keep-alive loop. Use k8s.io/client-go/tools/leaderelection with the Lease resource lock. The defaults (LeaseDuration 15s, RenewDeadline 10s, RetryPeriod 2s) encode years of production tuning. The most common bug in homegrown leader election is setting RenewDeadline too close to LeaseDuration and losing leadership while still believing you are the leader.

Operations: backup, restore, encryption, and the 3-vs-5 decision.

Operating etcd in production is not difficult, but it is unforgiving. The data it holds is irreplaceable (every API object, every Secret, every certificate, every config) and an etcd disaster makes the api-server immediately useless. Recovery from a backup is straightforward when you have practised it, and a horror movie when you have not. The operations work splits into four headings: sizing the cluster, taking snapshots, restoring from snapshots, and configuring encryption-at-rest.

The cluster sizing decision is the first thing you do and almost never the thing you change. The short answer is "always 3, sometimes 5, never even". The reasoning, in table form:

Members	Quorum	Tolerates	Verdict	Notes
1	1	0	dev only	no HA, single fsync hiccup stalls writes
2	2	0	never	one death = no quorum, worse than a single node
3	2	1	production default	survives one member; rolling upgrades stay at quorum
4	3	1	never	tolerates the same as 3, costs more, larger fault surface
5	3	2	large clusters	survives two; cross-AZ deployments; higher write latency
7	4	3	rare	extremely high availability targets, pays in write throughput

The 3-vs-5 trade-off is real and worth understanding. A 3-member cluster tolerates one failure and is the right answer for most production deployments. You can do rolling upgrades by taking one member out at a time without losing quorum, and the write latency is dominated by the slowest of two members rather than three. A 5-member cluster tolerates two failures, which matters for cross-AZ deployments where you want to survive an entire AZ going dark plus one stray failure on top of it. The cost is write latency: every write must wait for three acknowledgements rather than two, and the slowest-of-three is meaningfully worse than the slowest-of-two on tail-latency days. Seven members exist on paper but I have never met someone who runs that in production for Kubernetes; the marginal availability gain is dwarfed by the operational cost.

Backups are taken with etcdctl snapshot save, which produces a single binary file containing the entire database state at the snapshot revision. Snapshots can be taken from any member; etcd internally streams the consistent state at the current revision and writes it out. The snapshot is restored not into the running cluster but into a fresh, single-node cluster. You initialise that node from the snapshot, then add additional members one at a time to grow back to 3 or 5. The api-server is then re-pointed at the new cluster. The whole procedure can be done in under thirty minutes if you have practised it.

# Take a snapshot — run on any single etcd member, daily at minimum
$ ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt \
    --key=/etc/kubernetes/pki/apiserver-etcd-client.key \
    snapshot save /backups/etcd-$(date +%Y%m%d-%H%M).db

# Snapshot saved at /backups/etcd-20260503-1430.db

# Verify the snapshot is internally consistent
$ etcdctl snapshot status /backups/etcd-20260503-1430.db -w table
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| a8e2f9b1 | 487293   | 142847     | 1.4 GB     |
+----------+----------+------------+------------+

# Restore — done on a fresh node, NOT into a running cluster
$ ETCDCTL_API=3 etcdctl snapshot restore /backups/etcd-20260503-1430.db \
    --name etcd-recovery \
    --initial-cluster etcd-recovery=https://10.0.1.20:2380 \
    --initial-advertise-peer-urls https://10.0.1.20:2380 \
    --data-dir /var/lib/etcd-recovery

# Now point your api-server at the new endpoint, restart it,
# and add additional members back with `etcdctl member add`.

Encryption-at-rest is the second mandatory piece. Out of the box, etcd stores values as plain protobuf on disk. Anyone who can read the database file (a stolen disk, a backup that leaked to S3, a compromised control-plane node) can read every Secret in your cluster. Kubernetes addresses this not with etcd's native encryption (which exists, but is awkwardly scoped) but with the api-server's EncryptionConfiguration mechanism. You configure the api-server with a list of providers (a KMS key from your cloud, an aescbc key from a file, an identity provider for unencrypted reads of legacy data) and the api-server transparently encrypts values on write and decrypts on read. etcd just sees ciphertext.

# /etc/kubernetes/encryption-config.yaml — referenced by --encryption-provider-config
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
      - secrets
      - configmaps
    providers:
      - kms:
          name: aws-kms
          endpoint: unix:///var/run/kmsplugin/socket.sock
          cachesize: 1000
          timeout: 3s
      - aescbc:
          keys:
            - name: key1
              secret: c2VjcmV0IGlzIHNlY3VyZQ==     # base64 of 32 random bytes
      - identity: {}                  # reads of legacy unencrypted values

The provider list is ordered. The first provider is used to encrypt new writes, but every provider is tried in turn during reads. This is what lets you migrate: add a new KMS provider as the first entry, run kubectl get secrets -A -o json | kubectl replace -f - to rewrite every Secret with the new key, then drop the older provider after the migration completes. The identity provider at the end is essential during initial migration from an unencrypted cluster. Without it, reads of pre-encryption data fail.

One more rule: test your restore. Take a snapshot, build a single-node etcd from it on a disposable VM, point a kubectl at the resulting api-server, and confirm you can list pods. Until you have done this once, you do not have a backup; you have a file. The first time someone runs the restore for real should not be the first time the procedure is exercised.

Keep going.

Architecture — eight processes, one storage primitive

Where etcd sits in the larger control plane, and why only the api-server is allowed to touch it.

The lifecycle of kubectl apply

From keystroke to running pod, including the etcd transaction in the middle.

Read ↗

Read/write quorum simulator

Crash one of three. Crash two of three. Watch what Raft permits and what it refuses.

Open ↑

Back to the internals index

All twelve sub-pages — five live, seven planned — and the system on one canvas.

Index

Found this useful?

One key-value store,three nodes, no compromises.

Why etcd, when there were so many alternatives.

Raft, in five minutes: leader, log, commit.

MVCC: every key, every revision, indexed.

Watch streams: the change-log every controller subscribes to.

Compaction, defragmentation, and the 1.5 GB ceiling.

Leases: TTLs that bind keys to liveness.

Operations: backup, restore, encryption, and the 3-vs-5 decision.

Further reading: etcd docs, Raft paper, encryption guide.

Keep going.

One key-value store,
three nodes, no compromises.