Sub-page 01 · for infra + operator authors

Kubernetes internals · Architecture

Eight processes,
one storage primitive.

Kubernetes is not a monolith. It is a set of small, single-purpose processes — five in the control plane, three on every node — bound together by exactly one shared dependency: a strongly-consistent key-value store called etcd, which only one of those processes is ever allowed to touch.

This page is a tour of the eight, in order of how data flows. Where each process runs, on what port, with which protocol, and what failure mode you inherit by trusting it. Roughly 4,200 words. Pair it with the apply lifecycle sub-page for the request-trace view.

The control plane / data plane split.

Every distributed system that survives long enough grows a control plane: a small set of processes that decide what should happen, separate from the larger set of processes that actually do the work. Kubernetes is unusually disciplined about this division. The control plane is five processes — kube-apiserver, etcd, kube-scheduler, kube-controller-manager, and (in cloud installs) cloud-controller-manager — and they live on a small, odd-numbered set of dedicated nodes, classically three. The data plane is three processes per node — kubelet, kube-proxy, and a container runtime — replicated across however many machines you have to run workloads on, which in large fleets is thousands.

The split is enforced by the network and by the type system. Workload pods, by default, do not have credentials to talk to etcd or to bind to the api-server's privileged endpoints. Control-plane processes talk to each other over a private subnet with mTLS using a separate certificate authority from the one that signs node certificates. A pod that escapes to the host network on a worker node still cannot reach :2379 on a control-plane node unless an operator deliberately exposed it. The architecture is a series of one-way doors: the control plane reads the data plane through reports the kubelets push back, and writes the data plane only through the api-server.

This shape is not a Kubernetes invention. Borg, Omega, and the operating-system kernel that inspired all three follow the same pattern: a small kernel of authoritative state guarded by an access boundary, and a much larger fleet of stateless workers that obey instructions emitted from it. The reason it keeps reappearing is that it is the only way to make a distributed system reason about itself coherently. If two workers each had their own view of which container should run where, you would need a distributed agreement protocol between every worker; with a control plane, the agreement protocol runs once, in one place — etcd's Raft — and the workers simply follow.

The first hard rule of Kubernetes architecture, and the one most people get wrong on a whiteboard: the control plane is not the cluster. The cluster is the set of nodes running workloads. The control plane is a service that brokers their state. You can lose every control-plane node simultaneously and the workloads keep running — kubelet will keep its existing pods alive against its last-known spec, kube-proxy will keep its iptables rules in place, containers will keep serving traffic. What you lose is the ability to make any new decision: no new pods, no new rollouts, no rescheduling of evicted ones, no scaling. The control plane is the cluster's nervous system, not its body.

The corollary is that running the control plane on the same nodes as workloads is a category error. Most production installs taint control-plane nodes with node-role.kubernetes.io/control-plane:NoSchedule precisely so that the thing deciding where workloads go is not itself competing with workloads for CPU, memory, and file-descriptor pressure. Managed offerings (EKS, GKE, AKS) take this further and hide the control plane entirely; you only ever see the data plane.

Failure mode to internalise — if etcd loses quorum, the api-server returns 5xx for writes but kubelets keep running their pods. The cluster is "down" only in the sense that its mind has stopped; its body is still working.

The api-server is the only thing that matters.

If you remember one architectural fact about Kubernetes, make it this: everything goes through the api-server. Kubelet on a node does not read etcd. The scheduler does not read etcd. The controller-manager does not read etcd. They all talk to the api-server, and the api-server talks to etcd. This is enforced by the network — etcd's listener is bound to the control-plane subnet — and by the certificates — the etcd CA is separate from the api-server CA, and only api-server's client cert is signed by it.

The api-server is, in implementation, an HTTP server that exposes a REST surface backed by a generic storage layer. It is stateless. You can run three of them behind a TCP load balancer and clients can hit any one. Each instance keeps a watch cache in memory — a recent slice of the stream of changes from etcd — but the source of truth is etcd, and on a cold start it rebuilds the cache by issuing one big List from etcd at the current revision. Stateless replicas are why you can do rolling control-plane upgrades without taking the cluster down.

The api-server's job is six things in sequence, for every request: authenticate the caller, authorise the verb, run mutating admission webhooks, run validating admission webhooks, convert between API versions, and persist (or read) from etcd. The full pipeline is covered in the api-server sub-page; here the relevant piece is just that this is where every credential check, every quota check, every webhook hook, lives. You cannot bypass it. Your CI cannot bypass it. The cluster's own controllers cannot bypass it. There is no back door.

This single-front-door property is what makes the rest of the architecture possible. Because the api-server validates and serialises every change, every other component can subscribe to a well-defined stream of state-changes — a watch — without re-implementing authentication, authorisation, or schema validation. Kubelet starts up, opens an HTTPS connection to :6443, sends a watch request scoped to its own spec.nodeName, and receives a long-lived chunked HTTP/2 stream of JSON events: ADDED, MODIFIED, DELETED. That stream is the kubelet's entire input. It does not know what etcd is.

Concretely, the watch protocol looks like this on the wire — a bookmark-aware long poll that reuses the same TCP connection forever:

# GET /api/v1/pods?watch=true&resourceVersion=487291 HTTP/2
{"type":"ADDED","object":{"kind":"Pod","metadata":{"name":"web-7d8","resourceVersion":"487292"...}}}
{"type":"MODIFIED","object":{"kind":"Pod","metadata":{"name":"web-7d8","resourceVersion":"487293"...}}}
# … connection idle for 90s, server sends a bookmark to checkpoint the cursor
{"type":"BOOKMARK","object":{"kind":"Pod","metadata":{"resourceVersion":"487310"}}}
# … MODIFIED for the next change, and so on, until the client disconnects

When a watch breaks (network blip, api-server rolling update, idle timeout), the client reconnects with the last resourceVersion it saw. The api-server replays from its watch cache if the version is still in scope, or returns 410 Gone if the cache has rotated past it, in which case the client does a fresh List + Watch and reconciles. This relist-on-410 pattern is why every well-written controller is idempotent: it has to be ready to be told the current state from scratch at any moment.

Operational note — the api-server's watch cache is sized in objects per resource type, and on very large clusters (5,000+ nodes, 100,000+ pods) you will see clients getting 410'd because the cache window is shorter than your reconnect time. The fix is to tune --watch-cache-sizes per resource, or, more durably, use the streaming-list KEP-3157 path that consistent-list-from-cache uses.

etcd persists everything. Nothing else persists anything.

etcd is a distributed, strongly-consistent key-value store implementing the Raft consensus protocol over a multi-version concurrency-control (MVCC) backend. In a Kubernetes cluster, etcd holds every API object — every Pod, every Deployment, every Secret, every ConfigMap, every Lease — serialised as protobuf, keyed by a path that mirrors the URL of the resource. A pod called web-7d8 in namespace prod lives at /registry/pods/prod/web-7d8. The api-server's storage layer translates List/Watch/Get/Create/Update/Delete into etcd's Range/Watch/Get/Put/Txn/DeleteRange calls.

This restriction — only the api-server is allowed to talk to etcd — is the most consequential design decision in Kubernetes. It is not enforced by the storage layer; etcd would happily accept gRPC from anyone holding the right cert. It is enforced by certificate distribution and by the control-plane's network topology. The api-server has a client certificate signed by the etcd CA; no other Kubernetes component does. Kubelet's certificate is signed by the cluster CA, which etcd will refuse. Even kubectl on a control-plane node, run as root, cannot talk to etcd unless an operator has explicitly copied /etc/kubernetes/pki/etcd/ca.crt and apiserver-etcd-client.crt somewhere accessible.

The reason for the restriction is twofold. First, schema and validation: etcd is a dumb byte store, and if every controller wrote directly to it, every controller would need its own copy of the validation logic, the conversion logic, the admission logic, and the audit logic. Second, and more important, observability and access control: a single front door means a single audit log, a single RBAC surface, a single quota enforcement point, a single rate-limit. There is exactly one place to add a webhook to forbid privileged pods; if anything could write to etcd directly, you would need to forbid it in N places and the next clever feature would forget one.

etcd's MVCC semantics are surfaced into Kubernetes as the resourceVersion field on every object. It is a global revision number, monotonically increasing across the entire cluster, that tags every successful write. Watch streams are ordered by it. Optimistic concurrency uses it: when a client does an Update, the api-server includes the object's resourceVersion as the expected-revision in an etcd transaction, and etcd refuses if anyone else has bumped the key in the meantime — surfaced to the client as 409 Conflict. Every time you have read about a controller "retrying on conflict", that is what is happening underneath.

# What etcd stores — protobuf-encoded Pod object at a registry-path key
$ ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt \
    --key=/etc/kubernetes/pki/apiserver-etcd-client.key \
    get /registry/pods/prod/web-7d8 -w json | jq '.kvs[0].mod_revision'
# 487293
# That number is the same resourceVersion you see in `kubectl get pod -o yaml`.

Operationally, etcd is the most fragile component in the cluster. It needs low-latency disk (p99 fsync under 10ms; SSDs basically required), it dislikes co-tenancy with anything that generates IO contention, it refuses writes if its database grows past --quota-backend-bytes (default 2 GiB, often raised to 8 GiB), and it falls over on certain pathological workloads — large Secrets, very wide Lists, controllers that reconcile in tight loops. The Kubernetes failure modes that look like "the cluster is sluggish" are usually etcd disk latency.

A canonical control-plane disaster looks like this: a noisy controller — let us say a misbehaving operator that writes a 2KB annotation to every Pod every five seconds across 30,000 pods — multiplies into hundreds of MB/s of writes against etcd. etcd's disk fsync latency climbs from 2ms to 200ms. Raft heartbeats start missing their 100ms deadline. A leader election fires. Other etcd writes queue. The api-server's writes time out. Webhook calls back to the api-server time out. Cluster autoscaler can no longer schedule. The cluster does not crash; it just stops making progress, and this is hard to distinguish from a network outage. The remediation is always the same shape: throttle the bad writer, force-defrag etcd, expand the disk.

Backups are mandatory and a common gap. etcdctl snapshot save on a schedule, restored to a fresh quorum-sized cluster, is the only thing standing between you and a rebuild from kubectl manifests. Test the restore before you need it.

Scheduler and controller-manager — process boundaries.

The kube-scheduler and the kube-controller-manager are, in some sense, the cluster's two halves of decision-making. The scheduler answers exactly one question: which node should this pod run on? The controller-manager answers all the others — should a Deployment have more replicas? Should a Job retry? Should a Node be marked unhealthy? Should an endpoint be added to a Service? Splitting them into two binaries is a deliberate choice: scheduling is the most latency-sensitive operation in the cluster, and isolating it from the noisier reconciliation work means a hot loop of ReplicaSet adoption cannot starve pod placement.

Both are architecturally identical at the top level. They run a single leader-elected process that opens a watch on the api-server, populates an in-memory cache (called an informer), filters events for the resources it cares about, and runs reconciliation loops. Neither writes to etcd; both write to the api-server. Both can be replaced wholesale. kube-scheduler is replaceable per-pod (set spec.schedulerName to point at a different scheduler binary, run multiple in parallel); the controller-manager is replaceable per-controller (you can disable the built-in HPA and ship your own).

The controller-manager is, internally, about thirty controllers stitched into one binary purely to save on memory and on duplicate informer caches. The list as of 1.30: Deployment, ReplicaSet, DaemonSet, StatefulSet, Job, CronJob, EndpointSlice, Service, Node, NodeLifecycle, ServiceAccount, ServiceAccountToken, ResourceQuota, LimitRange, Namespace, GarbageCollector, HorizontalPodAutoscaler, TTLAfterFinished, PV, PVC, PVProtection, PVCProtection, PersistentVolumeBinder, AttachDetach, Bootstrap, CertificateApproval, CertificateSigning, CSIDriver, CSINode, Cluster Role Aggregation. Each is a structural copy of the same pattern: informer + work queue + reconciler. The controller pattern sub-page goes deep on this; here the architectural observation is just that they share one process boundary.

The cloud-controller-manager is the same idea, factored out for cloud-specific work. It runs the Node controller (cloud-provider variant — talks to the cloud API to look up instance metadata), the Service controller (creates the cloud LoadBalancer when you set type: LoadBalancer), and the Route controller (programs the cloud's VPC route table for pod CIDRs). The split exists so that the rest of Kubernetes can stay cloud-agnostic; the kube-controller-manager binary contains no AWS, GCP, or Azure code.

The diagram traces the entire decision: the user posts a Pod with no nodeName; api-server persists it; scheduler sees the watch event; scheduler runs its plugin chain to pick a node; scheduler issues a Bind subresource update against the api-server, which is the only way to set spec.nodeName on an already-existing Pod; api-server persists; the kubelet on the chosen node sees the watch event; the rest is data-plane machinery. Note that the scheduler never tells the kubelet anything directly. They have no connection to each other. They communicate exclusively through state changes mediated by the api-server.

This pattern — read state, decide, write state, let the watch propagate — is the only allowed control-flow shape inside the cluster. Direct RPC between control-plane components is forbidden by convention. If you ever build a custom controller and find yourself wanting to call another controller, you are about to make a mistake; what you actually want is a CRD that both controllers reconcile.

The data plane — kubelet, kube-proxy, CRI, CNI.

On every worker node, three Kubernetes processes run, plus one more that is plumbed in but is not strictly Kubernetes. The kubelet is the supervisor: it watches the api-server for pods assigned to its node, reconciles them with the local container runtime, performs probes, mounts volumes, and reports node health back. kube-proxy is the load balancer: it watches Services and EndpointSlices and programs the kernel — usually via iptables, increasingly via nftables or eBPF — so that a packet to a Service ClusterIP gets DNAT'd to a pod IP. The container runtime (containerd, CRI-O) is the thing that actually pulls images, configures cgroups, and starts processes. The CNI plugin (Cilium, Calico, Flannel) is the thing that gives each pod an IP and wires its veth into the node's network namespace.

Of the four, only the first two are Kubernetes binaries. The container runtime and the CNI plugin are deliberately external, gated by stable plug-in APIs. CRI is a gRPC interface served on a Unix socket — typically /run/containerd/containerd.sock — that kubelet calls when it needs to create a Pod sandbox or start a container. CNI is even simpler: a binary in /opt/cni/bin/ that kubelet (technically, the runtime on kubelet's behalf) execs with a JSON config on stdin, which prints the assigned IP on stdout. This sub-process design is why you can swap your CNI in production by changing the CNI config file and rolling the nodes; nothing in Kubernetes itself has to change.

The kubelet's main loop is sometimes called the SyncLoop, and it runs roughly ten times per second. Each iteration it consumes events from four sources — the api-server watch, a file-watch on /etc/kubernetes/manifests/ for static pods, an HTTP-pull source for legacy installations, and a periodic re-sync timer — and produces the desired set of pods the node should be running. It then walks the actual set of pods (queried from the runtime), diffs the two, and emits SyncPod operations for everything that differs. SyncPod is itself a small state machine: ensure sandbox, ensure init containers in order, ensure regular containers, start probes, report status. The kubelet sub-page goes deep.

kube-proxy is conceptually the simplest of the four. It maintains an iptables / nftables ruleset that captures every packet destined for a Service ClusterIP (which is not a real IP — there is no interface bound to it) and rewrites it to a randomly-chosen backend pod IP. On large clusters the iptables ruleset becomes pathological — a single Service with 500 endpoints produces a 500-rule linear scan per packet, and clusters with 10,000 services become CPU-bound. The remediation is IPVS (kernel hashtable), nftables (faster sets), or eBPF (socket-LB, Cilium). kube-proxy itself does not see the packets; it only programs the kernel. If kube-proxy crashes, the rules stay in place and traffic continues; what breaks is the next EndpointSlice update.

A subtlety worth holding: kubelet does not call kube-proxy, and kube-proxy does not call kubelet. They are independent watchers of the api-server, running on the same machine for convenience but not coordinated. Kubelet learns about pods; kube-proxy learns about Services and EndpointSlices; they happen to converge on a working network because the EndpointSlice controller (in the controller-manager) generates EndpointSlices from Pods, and kube-proxy reads those. The data flow is: pod created → api-server → endpointslice-controller → api-server → kube-proxy → iptables. Everything goes through the api-server, even between processes on the same machine.

Operational note — when a node goes NotReady, the api-server stops routing watch updates to it (it is unreachable), but the kubelet on that node keeps doing what it last saw. If the partition heals, the kubelet picks up where it left off. The cluster does not have a "fence" primitive; what it has is the eviction controller deleting Pods from etcd after a grace period, and a kubelet on the partitioned node eventually noticing the Pod is gone when it reconnects.

Leader election — the lease object, deeply.

If you run three replicas of the controller-manager for HA, you do not want all three trying to reconcile the same Deployment in parallel — they would race, conflict, and produce nondeterministic results. What you want is exactly one of them to be the active reconciler at any moment, with the other two in hot-standby ready to take over within seconds if the leader dies. Kubernetes solves this with a primitive called a Lease in the coordination.k8s.io/v1 API group, and a small client-go library called leaderelection that turns a Lease into a distributed lock.

A Lease is just a Kubernetes object — it lives in etcd like any other — with three interesting fields: holderIdentity (the unique ID of the current leader), leaseDurationSeconds (how long the lease is valid), and renewTime (the last timestamp the holder refreshed it). To acquire, every replica races to UPDATE the same Lease object with its own identity, using the api-server's optimistic-concurrency primitive: the request includes the expected resourceVersion, and etcd's transaction will accept exactly one of the parallel updates. The losers see 409 Conflict and back off. The winner is the leader.

To stay leader, the holder updates renewTime roughly every RenewDeadline / 2. If the holder dies (or its kubelet does, or its node partitions), it stops updating renewTime. The other replicas are watching the Lease, see the renewal stop, wait for leaseDuration to elapse since the last renewal, and then race again. The winning challenger sets a new holderIdentity and starts reconciling. Failover takes, in the default configuration, somewhere between fifteen and forty seconds — long enough that a brief network blip does not flap the leader, short enough that a dead process does not stall the cluster.

// client-go/tools/leaderelection — the lease pattern, abridged.
// https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection.go

leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
    Lock:          rl,                  // the Lease resource lock
    LeaseDuration: 15 * time.Second,    // followers wait this long before challenging
    RenewDeadline: 10 * time.Second,    // leader must renew within this
    RetryPeriod:    2 * time.Second,    // how often to attempt acquire/renew
    Callbacks: leaderelection.LeaderCallbacks{
        OnStartedLeading: func(ctx context.Context) {
            // I am the leader. Start informers, run the reconcile loop.
            controller.Run(ctx)
        },
        OnStoppedLeading: func() {
            // Lost the lease. Exit; the process supervisor will restart us.
            klog.Fatalf("lost leader lease")
        },
    },
})

The pattern's elegance is that it is built on top of the same optimistic-concurrency primitive that ordinary controllers use to update any Pod or Deployment. There is no special "leaderelection" gRPC; it is just an UPDATE on a Lease object with the resourceVersion guard, and etcd's Raft happens to give that update exactly the linearisability you need. The cluster's distributed-lock primitive is reused from its distributed-storage primitive, with no new code in the api-server.

There are roughly forty Lease objects in a typical cluster, mostly in kube-system and kube-node-lease. The latter is its own special use: one Lease per Node, updated every ten seconds by the kubelet, as a low-cost heartbeat replacement for the older mechanism of writing the entire Node status. That alone reduces api-server write QPS by 90% on large clusters compared to the pre-1.13 heartbeat. You can list them with kubectl get leases -A; the controller-manager's leader is at kube-system/kube-controller-manager, the scheduler's at kube-system/kube-scheduler, and so on for every leader-elected controller you run.

If you write your own controller and want HA, use the same library. The controller-runtime framework wraps it and exposes manager.Options.LeaderElection = true; the underlying call is the same. Choose a unique LockNamespace per cluster-instance of your controller — operator deployments commonly forget this and end up with two operator instances stomping on each other across two clusters that share a Kubernetes API.

Tuning rule of thumb — RenewDeadline must be less than LeaseDuration minus one RetryPeriod, or you can lose leadership while you still believe you hold it. Defaults of (15, 10, 2) satisfy this; do not invent your own without thinking.

Ports, TLS, and the firewall rules that have to be true.

A working Kubernetes cluster is, from a network-policy perspective, a small and very specific set of allowed flows. Every other flow should be denied; many production incidents trace back to a flow that "just happened to work" because the network was permissive, and broke when someone tightened it. The table below is the canonical set. Memorise the four important ones — 6443, 2379, 10250, 10256 — and you can debug 90% of control-plane connectivity issues from first principles.

Component	Port	Protocol	Inbound from	Notes
kube-apiserver	6443	HTTPS (TLS 1.2+)	every client	mTLS for control-plane peers; bearer-token + cert for clients
etcd (client)	2379	gRPC over TLS	kube-apiserver only	mTLS, peer cert pinned
etcd (peer)	2380	gRPC over TLS	other etcd members	Raft replication traffic
kube-scheduler	10259	HTTPS	metrics + healthz	no inbound RPC, watches api-server
kube-controller-manager	10257	HTTPS	metrics + healthz	leader-elected, watches api-server
cloud-controller-manager	10258	HTTPS	metrics + healthz	splits cloud-specific loops out
kubelet	10250	HTTPS	api-server, metrics-server	authn: webhook to api-server; serves logs, exec, stats
kubelet (read-only)	10255	HTTP	historical	disabled in modern installs
kube-proxy	10256	HTTP	health probes	/healthz, no API surface
NodePort range	30000–32767	TCP/UDP	external clients	kube-proxy programs the redirect

A few things in that table are worth pulling out. First, the api-server's 6443 is the only port any external client should ever talk to. If you are running a private cluster, this is the one port that gets exposed via a load balancer or a private endpoint. Everything else is internal traffic. Public exposure of 10250 (kubelet) is a known catastrophic misconfiguration: the kubelet exposes /exec and /run subresources, and prior to mandatory webhook authentication it could be hit anonymously to spawn root shells in pods. Modern installs require client-cert or webhook auth; do not regress.

Second, etcd's 2380 (peer Raft) is in some ways the most fragile port in the cluster. It needs low-latency, high-bandwidth connectivity between control-plane nodes, because every write must be replicated to a quorum before it is acknowledged. A control-plane spread across two regions is almost always a mistake — not for availability, because etcd is fine with cross-region peers in principle, but because the write latency dominates everything that touches the api-server, and a 50ms inter-region RTT becomes a 50ms floor on every write. Run etcd in a single AZ-pair if your provider's failure modes allow it.

Third, kubelet's 10250 is bidirectional in spirit if not in port: kubelet receives commands from the api-server (exec into a pod, port-forward, fetch logs, stream metrics), and reports stats back through the watch on Pod and Node objects. The connection model is "api-server initiates an HTTPS connection to the kubelet on demand". This means in installs where workers are behind NAT or inside private VPCs, the api-server has to have routable access back to every node. The Konnectivity service is the standard solution: nodes initiate a long-lived tunnel to the api-server, which reverse-proxies kubelet calls through it.

One TLS subtlety — the api-server has at least two server certificates: one for external clients (SANs include the apiserver-LB DNS name) and one for in-cluster service traffic (SAN kubernetes.default.svc). Get the SAN list wrong on a cert renewal and half the cluster's pods cannot talk to the api-server. The diagnostic is always openssl s_client -connect ... -showcerts.

Keep going.

The lifecycle of kubectl apply

Twelve hops from the keystroke to the running pod, named, timed, explained.

Pod scheduling, end to end

From pending Pod to running container, through the scheduler framework and the kubelet SyncLoop.

The controller pattern

Informers, listers, work queues, reconciliation. Pseudocode you can ship.

Read ↑

Back to the internals index

All twelve sub-pages — four live, eight planned — and the system on one canvas.

Index

Found this useful?

Eight processes,one storage primitive.

The control plane / data plane split.

The api-server is the only thing that matters.

etcd persists everything. Nothing else persists anything.

Scheduler and controller-manager — process boundaries.

The data plane — kubelet, kube-proxy, CRI, CNI.

Leader election — the lease object, deeply.

Ports, TLS, and the firewall rules that have to be true.

Further reading — kubernetes.io, source pointers, KEPs.

Keep going.

Eight processes,
one storage primitive.