Sub-page 10 · for practitioners + operators

Kubernetes internals · kubelet

One binary,
one node, one truth.

On every machine that runs a workload, exactly one Go binary is responsible for translating the api-server's idea of what should be running into actual Linux processes. It is called kubelet, it is roughly 200,000 lines of code under pkg/kubelet/, and almost everything that goes wrong on a Kubernetes node ultimately goes wrong inside it.

This page walks the kubelet from the moment systemd execs it to the moment it kills its last pod. The SyncLoop, the three pod sources, the CRI gRPC handshake, the CSI mount dance, the probe worker pool, the eviction decision tree, the on-node HTTP API. Pair it with the pod lifecycle sub-page for the cluster-side view; this page is strictly about the kubelet binary.

One binary per node, and what it isn't.

The kubelet is the only Kubernetes process on a node that holds privilege. It runs as root, it talks directly to the kernel for cgroups and namespaces, it owns the Unix socket to the container runtime, and it is the only thing on the box that has a client certificate signed by the cluster CA. Everything else on the node — the runtime, the CNI plugin, kube-proxy, and any DaemonSet you have ever written — is either a child process started by the kubelet's instructions or a peer that consumes the same api-server watch the kubelet does. There is exactly one kubelet per node, supervised by systemd as kubelet.service, and if it dies the node goes NotReady within forty seconds.

It is worth being precise about what the kubelet is not. It is not a scheduler. It does not pick which pods to run; the scheduler upstream has already written spec.nodeName on the Pod object before the kubelet ever sees it. The kubelet's filter is dumb: any Pod whose spec.nodeName equals its own hostname is mine, anything else is not. It is also not a container runtime. It does not pull images, it does not start processes, it does not configure cgroup hierarchies directly. Everything that touches a process is delegated over the Container Runtime Interface (CRI) to containerd or CRI-O, which in turn delegates to runc. Finally, it is not a network programmer. It does not write iptables rules; that is kube-proxy. It does not assign pod IPs; that is the CNI plugin invoked by the runtime. The kubelet's job is supervision and reconciliation, and almost every operation it performs is a thin wrapper around a call to something else.

Why one binary per node and not, say, one cluster-wide controller managing all pods remotely? The answer is partition tolerance. If a network partition severs a node from the api-server, you want the workloads on that node to keep running on their last-known spec, with their probes still firing and their resources still bounded, until either the partition heals or the eviction controller upstream removes them from etcd and the kubelet eventually stops them. A node-local supervisor can hold the line; a remote one cannot. The Borg paper makes the same argument under the name "Borglet"; the kubelet inherits both the design and (roughly) the word count of its predecessor.

Architecturally, the kubelet is a single Go process with one main goroutine — the SyncLoop and a constellation of named worker goroutines around it. The SyncLoop reads from channels populated by three pod-source goroutines and a periodic timer, dispatches reconciliation work to a per-pod worker pool, and exposes a small HTTPS server on :10250 for the api-server to call back into for logs, exec, and port-forward. Almost every other named subsystem you will read about. The volumeManager, the probeManager, the plegManager (Pod Lifecycle Event Generator), the imageManager, the evictionManager, the deviceManager, the statusManager. Is a goroutine cluster that the SyncLoop consults during each reconciliation pass. The whole thing is roughly fifty live goroutines on a quiet node and several hundred on a busy one.

The kubelet is configured by two artifacts. The first is a YAML file, conventionally /var/lib/kubelet/config.yaml, of kind KubeletConfiguration in API group kubelet.config.k8s.io/v1beta1. This is where you set eviction thresholds, the cluster DNS server, the CRI socket path, and the cgroup driver. The second is a kubeconfig at /etc/kubernetes/kubelet.conf with the node's client certificate and the api-server URL. Anything else (--register-node, --hostname-override) goes on the systemd unit's command line. Most kubeadm-installed clusters keep flags minimal and put everything in the YAML.

Failure mode to internalise. The kubelet itself is intentionally restartable. systemd restarts it on crash, and on restart it re-syncs the pod list against the runtime without killing any running containers. A kubelet flap of a few seconds is invisible to workloads. A kubelet flap of more than the node-monitor-grace-period (default 40s) marks the node NotReady and starts the eviction clock.

Three pod sources, merged in priority order.

The kubelet's input is a multiplexed channel of Pod specs from three sources, merged inside pkg/kubelet/config/ by a small fan-in goroutine. The dominant source is the api-server watch: a long-lived HTTP/2 connection to :6443 with a field selector of spec.nodeName=<hostname>, scoped to that one node so the kubelet never sees pods that aren't its problem. Events on this stream. ADDED, MODIFIED, DELETED translate into pod-config updates, with the api-server's resourceVersion as the watermark. This is how 99% of pods get to a node, and the only source most operators ever think about.

The second source is the file source: an inotify watch on /etc/kubernetes/manifests/ (path is configurable via staticPodPath in KubeletConfiguration). Any YAML or JSON file dropped here is parsed as a Pod manifest and treated as a "static pod" — a pod the kubelet runs autonomously, without the api-server's permission, simply because a file on disk says so. The kubelet then helpfully mirrors each static pod by creating a corresponding read-only Pod object in the api-server (named with the suffix -<nodename>), so kubectl users can see it; the mirror is a one-way reflection, kubectl edits to it do nothing, and deleting the mirror gets it recreated within a SyncLoop tick. This is how the control plane bootstraps itself: kube-apiserver, kube-controller-manager, kube-scheduler, etcd, and kube-proxy on a control-plane node are usually static pods, because the api-server cannot schedule itself.

The third source is the HTTP source: if the kubelet is started with --manifest-url=https://... it polls that URL every twenty seconds and treats the returned JSON or YAML as a list of Pod manifests, the same as the file source. It is a relic from before the api-server existed in its current form, used in a handful of edge deployments where a node needs a manifest but cannot mount a filesystem path. It is mostly dead code now and most production clusters disable it; if you have ever set --manifest-url by accident, you have probably already been told.

All three sources funnel into a structure called PodConfig which holds, per-source, the latest set of Pods that source has produced. The kubelet's view of the world at any moment is the union of those three sets, merged on Pod UID. If two sources produce the same UID — which should never happen — the file source wins, then HTTP, then api-server. The priority is hard-coded and matters because a static pod accidentally named the same as an api-server-scheduled pod will simply ignore the api-server's instructions; this is occasionally a subtle source of pain in clusters that mix static and managed pods on the same node.

# A static pod manifest lives at /etc/kubernetes/manifests/etcd.yaml on a control-plane node.
# It is run by the kubelet directly. The api-server has nothing to do with its lifecycle.
apiVersion: v1
kind: Pod
metadata:
  name: etcd
  namespace: kube-system
  labels:
    component: etcd
    tier: control-plane
spec:
  hostNetwork: true
  priorityClassName: system-node-critical
  containers:
  - name: etcd
    image: registry.k8s.io/etcd:3.5.13-0
    command:
    - etcd
    - --advertise-client-urls=https://192.0.2.10:2379
    - --listen-client-urls=https://0.0.0.0:2379
    - --data-dir=/var/lib/etcd
    volumeMounts:
    - name: etcd-data
      mountPath: /var/lib/etcd
    - name: etcd-certs
      mountPath: /etc/kubernetes/pki/etcd
  volumes:
  - name: etcd-data
    hostPath: { path: /var/lib/etcd, type: DirectoryOrCreate }
  - name: etcd-certs
    hostPath: { path: /etc/kubernetes/pki/etcd, type: DirectoryOrCreate }

Two things to notice. First, there is no spec.nodeName — static pods inherit the node they sit on by definition. Second, the kubelet will create a mirror Pod in the api-server called etcd-cp-1 (where cp-1 is the hostname); kubectl will see and report on it, but cannot delete or modify it. The only way to remove a static pod is to remove the file. This asymmetry is occasionally surprising during cluster upgrades; kubeadm's upgrade logic always rewrites the manifest files in /etc/kubernetes/manifests/ rather than using kubectl.

Edge case worth knowing. The kubelet checksums each static pod manifest and only restarts the pod if the checksum changes. Editing the file in place with a no-op (touch) does nothing; you have to actually change a byte. This is what kubeadm upgrade node exploits to roll the control-plane components without flapping unrelated containers.

SyncPod, SyncTerminatingPod, SyncTerminatedPod.

The kubelet's main goroutine is syncLoop() in pkg/kubelet/kubelet.go, and the function it calls every iteration is syncLoopIteration(). It is a select statement over four channels: updates from the merged pod-config source, events from the PLEG (the runtime poller that emits container-state-change events), a periodic ten-second housekeeping tick, and a faster one-second sync timer for pods that are mid-transition. Each event ultimately enqueues work onto a per-pod goroutine — the pod worker. So that two pods never block each other and one stuck container cannot wedge the loop.

Per pod, the worker dispatches into one of three reconciliation methods, all on the same PodWorkers interface. SyncPod is the steady-state path: ensure the sandbox exists, ensure init containers have run in order, ensure regular containers are running, kick the probes, update status. It runs whenever a pod is in phase Pending or Running and not yet being deleted. SyncTerminatingPod is the shutdown path: it runs the preStop hooks, sends SIGTERM to each container, waits up to terminationGracePeriodSeconds, and then SIGKILLs anything still alive. It runs once per pod when deletionTimestamp is set. SyncTerminatedPod is the cleanup path: it tears down the sandbox, unmounts the volumes, releases the IP back to CNI, and finally allows the api-server to garbage-collect the Pod object by removing the kubelet's finalizer. The three are strictly ordered: SyncTerminatingPod may not run before SyncPod has finished its current pass, and SyncTerminatedPod may not run before SyncTerminatingPod has signalled completion. This three-state machine replaced the older two-state design in 1.22 (KEP-3675) and fixed a long-standing class of bugs around volume detach during pod deletion.

The ten-second housekeeping tick deserves its own paragraph because it is the source of much kubelet folklore. Every ten seconds, the SyncLoop walks every pod the kubelet currently knows about, asks the runtime for the actual container set, diffs them, and emits sync events for anything that drifted. This is the kubelet's idempotency guarantee: even if a watch is missed, even if the PLEG fires late, even if a container exits unexpectedly, the housekeeping tick will notice within ten seconds and dispatch the appropriate SyncPod. The flip side is that ten seconds is also the lower bound on a lot of operations: the time to notice a CNI failure, the time to restart a CrashLoopBackOff container that just hit its backoff window, the time to update a status condition. You cannot make the kubelet faster than its tick, and the tick is hard-coded; you only get to choose how long the tick is via --sync-frequency, default 60s, but the kubelet also has internal subloops that run faster.

Concretely, the SyncPod state machine for a fresh pod hitting a node looks like this. Watch event arrives with spec.nodeName=N. The kubelet computes the desired container set, notes there is no sandbox, calls the volume manager to mount the declared volumes (this can take seconds for CSI), waits for the volume manager's WaitForAttachAndMount to return, then calls the runtime's RunPodSandbox to create the pause container and the network namespace. Once the sandbox is up. Meaning the runtime has reported a sandbox ID and an IP the kubelet calls PullImage in parallel for every container's image, then CreateContainer + StartContainer for each init container in order, then the same for the regular containers in parallel. Probes start firing. The status manager updates the Pod's status via the api-server. End of SyncPod.

// pkg/kubelet/pod_workers.go — abridged signature of the three sync paths.
// https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pod_workers.go

SyncPod(ctx, updateType, pod, mirrorPod, podStatus) (isTerminal bool, err error)
    // 1. ensure pod-level cgroup (Burstable/Guaranteed/BestEffort QoS class)
    // 2. volumeManager.WaitForAttachAndMount(pod)
    // 3. computePodActions: which sandbox/containers need start, kill, restart?
    // 4. runtime.SyncPod(...): RunPodSandbox if needed, then init then regular
    // 5. probeManager.AddPod(pod) starts liveness/readiness/startup workers
    // 6. statusManager.SetPodStatus → api-server PATCH

SyncTerminatingPod(ctx, pod, podStatus, gracePeriod, podStatusFn) error
    // runs once when deletionTimestamp is observed
    // 1. probeManager.RemovePod(pod)  — stop firing probes
    // 2. for each container: preStop hook → SIGTERM → wait → SIGKILL
    // 3. runtime.KillPod(...) drops the sandbox last

SyncTerminatedPod(ctx, pod, podStatus) error
    // runs once after SyncTerminatingPod returns
    // 1. volumeManager.WaitForUnmount(pod)
    // 2. cgroupManager.Destroy(pod)
    // 3. statusManager.TerminatePod → api-server (final status)
    // 4. remove the "kubelet" finalizer → api-server GC can delete the Pod

Two operational implications fall out of this design. First, idempotence — every SyncPod call must produce the same end state regardless of how many times it runs against the same input. The kubelet relies on this so heavily that the runtime interface is built around "ensure" semantics rather than "create": RunPodSandbox is a no-op if the sandbox already exists, CreateContainer is rejected with a specific gRPC code if the container is already created. Second, per-pod isolation because each pod has its own goroutine, a pod whose volume mount hangs for two minutes blocks only itself. Other pods on the node continue to reconcile. This is why a single misbehaving CSI driver does not take down a node, only the pods that depend on it.

Debug tip. When a pod is "stuck" in Terminating, the question to ask is which of the three sync paths is blocked. kubectl describe tells you the phase; journalctl -u kubelet tells you which sub-step. Almost always it is volume unmount on a CSI driver that lost its connection, or a preStop hook that sleeps longer than terminationGracePeriodSeconds. The kubelet does not lie; it is just very quiet about waiting.

CRI. The gRPC contract with the runtime.

Everything the kubelet does that touches a process or an image goes over the Container Runtime Interface. A gRPC service defined in staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto. The kubelet dials a Unix domain socket — typically unix:///run/containerd/containerd.sock for containerd or unix:///run/crio/crio.sock for CRI-O, and speaks gRPC over it. CRI splits the surface into two services: RuntimeService (sandboxes, containers, exec, port-forward) and ImageService (pull, list, remove). They are deliberately split so that an image-only daemon could implement just the latter; in practice every modern runtime ships both.

The runtime side of the socket — containerd, in the dominant case. Is a daemon that itself delegates per-container work to a per-pod sub-process called a shim. When the kubelet calls RunPodSandbox, containerd forks a containerd-shim-runc-v2 for the new pod, which forks runc to actually create the cgroup, set up the namespaces, and exec the pause container. The shim then sticks around for the lifetime of the pod, supervising the containers in it. If containerd dies and is restarted, the shims survive. They hold their containers' lifecycle state independently. This is why a containerd restart is non-disruptive to running workloads.

The handshake for a single pod looks like the diagram below: the kubelet calls RunPodSandbox with the pod's namespace, name, UID, and a LinuxPodSandboxConfig that includes the cgroup parent path. The runtime calls the CNI plugin (via /opt/cni/bin/..., with config from /etc/cni/net.d/) to allocate a pod IP and wire the veth pair. The runtime returns the sandbox ID and the IP. The kubelet then calls PullImage for each image (skipped if the image is already cached with the right pull policy), CreateContainer with the per-container config. Image, command, env, mounts, resources, and finally StartContainer.

Image pull is its own subsystem. The kubelet maintains an in-memory cache of pulled images and a disk-backed cache provided by the runtime; the policy imagePullPolicy determines whether to consult the registry. Modern clusters offload registry credentials to a credential provider plugin. A binary in /etc/kubernetes/image-credential-provider/ that the kubelet execs with the image name on stdin and reads JSON credentials from stdout. This is how EKS, GKE, and AKS delegate to IAM-backed registry auth without putting cloud SDKs in the kubelet binary; the cloud-specific credential provider is a tiny external program and the kubelet stays cloud- agnostic. KEP-2133 standardised this in 1.26.

# proto/v1/api.proto — RuntimeService excerpt (abridged)
service RuntimeService {
  rpc Version(VersionRequest) returns (VersionResponse) {};
  rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {};
  rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {};
  rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {};
  rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {};

  rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {};
  rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {};
  rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {};
  rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {};
  rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {};

  rpc Exec(ExecRequest) returns (ExecResponse) {};
  rpc Attach(AttachRequest) returns (AttachResponse) {};
  rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {};

  rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse) {};
}

Once you understand the CRI surface, debugging a node becomes much easier. Because the same surface is available to humans through crictl, a CLI maintained alongside the kubelet that talks the same gRPC. When kubelet logs are unhelpful, crictl lets you ask the runtime directly.

# Read the runtime's view of the world. crictl bypasses the kubelet entirely.
$ sudo crictl --runtime-endpoint unix:///run/containerd/containerd.sock pods
POD ID         CREATED        STATE    NAME                NAMESPACE   ATTEMPT
3f2a1b9c8d7e   2 hours ago    Ready    web-7d8             prod        0
9e1f5c4a8b6d   3 hours ago    Ready    kube-proxy-9k4xt    kube-system 0
2b8d4e7f1c5a   3 hours ago    Ready    cilium-w7r2p        kube-system 0

$ sudo crictl ps -a
CONTAINER      IMAGE                              STATE     NAME       POD
ab12cd34ef56   nginx@sha256:1.27.0                Running   web        web-7d8
fe65dc43ba21   k8s.gcr.io/kube-proxy:v1.30.2      Running   kube-proxy kube-proxy-9k4xt
# If a pod is "stuck" in Pending, this is where to look first — does the sandbox even exist?

Production gotcha — the CRI socket is also how the kubelet does kubectl exec: the api-server forwards the exec stream to the kubelet's :10250, the kubelet calls Exec on the runtime, the runtime calls into the shim, the shim attaches to the container's TTY. Five hops, three protocols. If exec hangs, the question is which hop. Almost always it is the runtime; crictl exec from the node skips the first two hops and tells you immediately.

CSI. NodePublishVolume and the staging area.

Volumes used to be linked into the kubelet binary itself — there were "in-tree" volume plugins for AWS EBS, GCE PD, Azure Disk, NFS, iSCSI, Ceph, and on, each implemented as a Go package inside pkg/volume/. Every storage vendor needed code merged upstream; every kubelet release was an integration testing nightmare. KEP-1490 finished migrating these to the Container Storage Interface in 1.26, and as of 1.30 the in-tree implementations are stubs that proxy to CSI drivers. Today, almost every volume on a Kubernetes node is mounted by a per-driver DaemonSet pod that the kubelet talks to over a Unix socket.

CSI splits the work between two services. The controller plugin runs as a single StatefulSet replica somewhere in the cluster and handles cluster-wide operations: provision the underlying volume in the cloud, attach it to a node (in the AWS sense, where attach is an EC2 API call). The node plugin runs as a DaemonSet pod on every node and handles per-node operations. The kubelet only ever talks to the node plugin, on a socket the plugin advertises to the kubelet via the plugin registration protocol. A separate gRPC at /var/lib/kubelet/plugins_registry/<driver>/. This dance is orchestrated by the kubelet's volumeManager.

A pod with a CSI volume goes through three steps inside the volumeManager. Step one, WaitForAttach: the kubelet waits for the cluster's external-attacher controller (in the CSI controller plugin) to mark the volume as attached to this node, by setting VolumeAttachment.status.attached=true in the api-server. Step two, NodeStageVolume: the kubelet calls the node plugin's NodeStageVolume RPC with a staging path on the host /var/lib/kubelet/plugins/kubernetes.io/csi/volumes/staging/<volume-id>/. The plugin formats the device if necessary and bind-mounts it at this path. Crucially, this is done once per node per volume, even if multiple pods on the node mount the same PV (read-write-many or read-only-many). Step three, NodePublishVolume: the kubelet calls NodePublishVolume with a target path inside the pod's directory. /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<volume-name>/mount — and the plugin bind-mounts the staging path onto the target. Pod sees the volume.

The staging-then-publish split exists because of a subtle Linux quirk: filesystem operations like mkfs and bind-mount are cheap if done once and then bind-mounted many times, but expensive if redone for each pod. Without the staging step, every pod restart would re-mkfs (no, it would not, but it would re-mount-from-block-device, which has the same cost). The staging path also gives a stable filesystem identity to a volume across pod restarts on the same node, which the kubelet uses for crash recovery: if the kubelet restarts, it scans the staging directory, finds the existing mounts, and avoids re-staging.

Unmounting reverses the dance. SyncTerminatedPod calls the volumeManager's WaitForUnmount, which calls NodeUnpublishVolume (releases the per-pod target). Once the last pod on the node releases the volume, the volumeManager calls NodeUnstageVolume (releases the per-node staging). Once that returns, the api-server's external-attacher detaches the volume from the node. The kubelet blocks pod deletion on this. The pod's finalizer is not removed until unmount completes which is the most common cause of pods stuck in Terminating. The diagnostic is mount | grep <pod-uid> on the node, plus the CSI driver pod's logs.

# What a CSI mount looks like in /proc/self/mounts on a node, for one pod with one PV.
/dev/nvme1n1 on /var/lib/kubelet/plugins/kubernetes.io/csi/volumes/staging/pvc-7b4c (rw,relatime)
/var/lib/kubelet/plugins/kubernetes.io/csi/volumes/staging/pvc-7b4c
  on /var/lib/kubelet/pods/3a2b1c-...-uid/volumes/kubernetes.io~csi/pvc-7b4c/mount (rw,relatime,bind)
# The first is from NodeStageVolume. The second is from NodePublishVolume.
# Two pods on the same node mounting the same PV add a second bind-mount; staging is shared.

Operational note. Most CSI drivers run their node plugin as a privileged DaemonSet because bind-mount and mkfs require CAP_SYS_ADMIN. If your CSI driver pod is failing to start, the volumeManager will block all pods that need its volumes, which cascades into NotReady nodes after a few minutes. Watch DaemonSet pod readiness on every upgrade.

Probes. Three kinds, one worker pool.

The kubelet runs three kinds of probe per container, all with identical structure but different semantics. Liveness answers "is the container in a state where it should be killed and restarted?". Failure SIGKILLs the container and increments restartCount. Readiness answers "is the container ready to receive traffic?". Failure removes the pod's IP from EndpointSlices, which causes kube-proxy to stop routing to it. Startup answers "has the container finished booting?". Until it succeeds, liveness and readiness probes are suspended; the practical effect is that slow-starting Java applications can use a long startup probe and then a tight liveness probe without false positives during boot. Startup is the newest of the three, GA in 1.20, and is the right answer to almost every "my probe keeps killing my pod during startup" support ticket.

Each probe can use one of three handlers. httpGet opens an HTTP connection to the container's pod IP on a configured port and checks the response code (2xx and 3xx are success). tcpSocket opens a TCP connection and checks that it completes the handshake. exec runs a command inside the container and checks the exit code. There is also grpc as of 1.24 (KEP-2727), which speaks the standard gRPC health-check protocol. The right choice for gRPC services that already implement it.

Inside the kubelet, probes are managed by prober.Manager in pkg/kubelet/prober/. When SyncPod adds a pod, the prober manager starts one goroutine per probe per container. So a pod with two containers and all three probes on each gets six goroutines. Each goroutine runs an inner loop: time.Sleep(periodSeconds), run the handler, record the result, push to the status manager if the result changed. The pool is unbounded; on a node with many pods and tight probe periods, you can easily have a thousand probe goroutines running in parallel. This is fine — they are mostly blocked on network I/O, but it means probe handlers should be cheap, because every node-wide CPU cost multiplies by the number of containers on the node.

The semantics around httpGet have a wrinkle worth knowing. The kubelet does not follow redirects, does not send a body, does not set Host headers in any clever way, and does not honour Transfer-Encoding: chunked the way browsers do. The User-Agent is kube-probe/<minor>. If your application's health endpoint returns 200 to curl but the kubelet records failures, suspect the host header (the kubelet uses the pod IP, not a DNS name) or a TLS misconfiguration (kubelet skips TLS verification on probes by default but does not skip handshake errors). The probe timeout is a hard timeout on the entire request, including TCP setup; for a slow httpGet handler, the right answer is to increase timeoutSeconds, not to add retries.

# A pod spec with all three probe types declared. Most production deployments have all three.
spec:
  containers:
  - name: api
    image: my-org/api:1.4.2
    ports:
    - containerPort: 8080
    startupProbe:
      httpGet: { path: /healthz/startup, port: 8080 }
      failureThreshold: 30                  # 30 × 10s = up to 5 minutes to boot
      periodSeconds: 10
    readinessProbe:
      httpGet: { path: /healthz/ready, port: 8080 }
      periodSeconds: 5
      timeoutSeconds: 2
      failureThreshold: 2
    livenessProbe:
      httpGet: { path: /healthz/live, port: 8080 }
      periodSeconds: 10
      timeoutSeconds: 2
      failureThreshold: 3                   # 3 × 10s = ~30s before SIGKILL

The reason readiness and liveness should be different endpoints, and they almost always should. Is that they answer different questions. A backend whose database connection has died is not ready (do not send it traffic) but also not necessarily dead (restarting will not help if the database itself is down). Tying the two together produces a CrashLoopBackOff cascade where every replica restarts in unison the moment the database hiccups. The right answer is a readiness probe that checks "can I serve a request right now" and a liveness probe that checks "am I still alive in the sense of having an event loop", typically a much simpler check that rarely fails unless the process is wedged. The probe generator walks through the trade-offs.

Probe gotcha. Exec probes fork a process inside the container on every period. A liveness probe with exec: pgrep myapp at periodSeconds: 1 across 100 pods on a node creates 100 short-lived processes per second on top of your workload, which on small nodes is enough to dominate scheduling latency. Prefer httpGet or tcpSocket when possible; reserve exec for cases where there is no socket-level signal.

Eviction. Six signals, two thresholds, one ordering.

The kubelet is the last line of defence between a node and total resource exhaustion. The Linux kernel's OOM killer is brutal and node-blind: under memory pressure it picks a victim by oom_score, kills it, and walks on, with no notion of pod priority or workload importance. The kubelet's eviction manager is a softer, smarter substitute: it watches a small set of eviction signals derived from cgroup metrics and filesystem stats, compares them against operator-defined thresholds, and proactively kills pods to bring the node back into a healthy state. Picking victims by QoS class, priority, and resource overage rather than by kernel heuristics.

There are six signals the kubelet monitors. Memory and PID exhaustion produce killable pods directly. Filesystem signals. Split between the node's root filesystem and the runtime's image filesystem. First try to reclaim space by garbage-collecting unused images and stopped containers, and only escalate to evicting pods if reclamation fails. Each signal has a soft threshold (must persist for a grace period before triggering, allowing pods to exit with their normal termination grace) and a hard threshold (immediate, no grace period). The defaults in modern kubelet builds are conservative; many production operators tighten them.

Signal	Source	Default soft	Default hard	What kubelet does
memory.available	cgroup memory.current vs node total	< 500Mi for 2m	< 200Mi	rank pods by usage > request, kill highest QoS-Best-Effort first
nodefs.available	statfs of /var/lib/kubelet	< 15%	< 10%	evict pods with emptyDir or large logs; trigger image-GC
nodefs.inodesFree	statfs inodes on rootfs	< 10%	< 5%	same ranking as nodefs.available; inode exhaustion is rarer
imagefs.available	statfs of containerd image dir	< 15%	< 10%	image GC first; evict only if container layers are still wedged
imagefs.inodesFree	statfs inodes on imagefs	< 10%	< 5%	image GC; rarely fires alone
pid.available	/proc/sys/kernel/pid_max minus current	< 10%	< 5%	evict the largest fork-bomb pod; nodeStatus → MemoryPressure=False, PIDPressure=True

When a threshold is crossed, the eviction manager runs a small ranking algorithm. First, it excludes critical pods (priority > 2 billion, or annotation scheduler.alpha.kubernetes.io/critical-pod on legacy installs). Second, among the remainder, it sorts by QoS class: BestEffort first, Burstable next, Guaranteed last. Third, within a QoS class, it sorts by usage relative to request — pods using vastly more than they requested are killed before pods using exactly what they requested. Fourth, ties are broken by priorityClass if set. The kubelet evicts pods one at a time, re-checks the signal, and stops as soon as the threshold clears. The whole loop runs at most every ten seconds, the same housekeeping cadence as everything else.

All of this is configurable through the KubeletConfiguration's evictionHard, evictionSoft, evictionSoftGracePeriod, and evictionMaxPodGracePeriod fields. The defaults are deliberately conservative. They target steady-state production, not bursty batch, and most operators running batch workloads tighten them. The pod eviction simulator lets you walk through different threshold settings against a synthetic node.

# /var/lib/kubelet/config.yaml — eviction excerpt of a real KubeletConfiguration.
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

evictionHard:
  memory.available: "200Mi"
  nodefs.available: "10%"
  nodefs.inodesFree: "5%"
  imagefs.available: "10%"
  pid.available: "5%"

evictionSoft:
  memory.available: "500Mi"
  nodefs.available: "15%"
  imagefs.available: "15%"

evictionSoftGracePeriod:
  memory.available: "2m"
  nodefs.available: "2m"
  imagefs.available: "2m"

evictionMaxPodGracePeriod: 60          # cap pod's terminationGracePeriod under soft eviction

# See also /tools/k8s/resource-calculator/ for picking sensible thresholds vs node size.

A subtlety: memory.available is computed not from /proc/meminfo but from the kubelet's view of the kubepods cgroup, minus a configured systemReserved + kubeReserved. This is intentional. The kubelet does not care if the host kernel is running close to the line for its own caches; it cares whether the pods it is running together have collectively used up the budget you allocated to them. Sizing systemReserved too small means the kubelet evicts pods while the host still has memory, which feels weird; sizing it too large means the host gets OOM-killed by the kernel before the kubelet acts. The resource calculator tool walks through reasonable defaults by node size.

Eviction war story. A node with a stuck containerd sometimes shows imagefs.available dropping while the pods themselves are healthy. The eviction manager will start killing pods to make space, the killed pods will not actually release disk because their layers are still mounted, and the node spirals into evicting every pod on it. The remediation is always to fix the runtime first; the eviction manager will never recover a node whose runtime is wedged.

cgroups, the :10250 API, device plugins, and where to read more.

A few last subsystems deserve a paragraph each before the source pointers. cgroup setup is the kubelet's job, even though the runtime does the actual cgroup-write. The kubelet computes a per-pod cgroup path under /sys/fs/cgroup/kubepods.slice/ based on the pod's QoS class kubepods-besteffort.slice/ for BestEffort, kubepods-burstable.slice/ for Burstable, and the root kubepods.slice/ for Guaranteed. It then passes this path to the runtime in the cgroup_parent field of RunPodSandbox; the runtime creates the per-container leaves under it. The two cgroup drivers. cgroupfs and systemd — must match between the kubelet and the runtime, or neither can find the other's cgroups; this is the most common kubeadm-bootstrap failure mode. cgroup v2 (the default on most modern distros) consolidates memory, CPU, and IO into one unified hierarchy and makes a few of the older accounting weirdnesses go away.

The :10250 HTTPS API is the kubelet's only inbound surface. It serves /healthz for kubelet self-checks, /metrics for Prometheus, /stats/summary for the metrics-server pipeline, /logs/<pod>/<container> for kubectl logs, /exec/<pod> for kubectl exec, /portForward for kubectl port-forward, and a few internal endpoints. Authentication is via TLS client cert (the api-server presents its serving cert) or webhook (TokenReview back to the api-server for service-account tokens). Authorisation is via SubjectAccessReview to the api-server, which means the kubelet defers to RBAC for who can exec into a pod; you do not configure this on the kubelet itself, you configure it via Roles and ClusterRoles bound to whoever is calling kubectl. Public exposure of :10250 is one of the most catastrophic possible misconfigurations of a Kubernetes cluster. Pre-1.16 it was anonymous, and even today if RBAC is permissive it is a remote-code-execution gateway.

The device plugin framework is how the kubelet learns about per-node resources beyond CPU and memory. GPUs, FPGAs, RDMA NICs, dedicated cores, anything vendor-specific. A device plugin is a process running on the node — usually as a privileged DaemonSet pod. That registers itself with the kubelet via a gRPC over a Unix socket at /var/lib/kubelet/device-plugins/<driver>.sock. The plugin announces a resource name (e.g., nvidia.com/gpu) and reports the list of devices it manages. When a pod requests one ( resources.limits["nvidia.com/gpu"]: 1), the kubelet asks the plugin's Allocate RPC for a device, gets back a list of device node paths, and passes them to the runtime as host devices to bind into the container. The pattern is also how Topology Manager and CPU Manager hand out resources with NUMA awareness; both are implemented inside the kubelet but talk to device plugins to learn what is available. NVIDIA's nvidia-device-plugin is the canonical example.

If you want to read further, the source tree is unusually approachable: the kubelet binary is one main package and most subsystems are well-named subdirectories. pkg/kubelet/kubelet.go is the SyncLoop; pkg/kubelet/pod_workers.go is the per-pod state machine; pkg/kubelet/kuberuntime/ is the CRI client wrapper; pkg/kubelet/volumemanager/ is the volume reconciliation; pkg/kubelet/prober/ is the probe pool; pkg/kubelet/eviction/ is the eviction manager; pkg/kubelet/cm/ is the cgroup, CPU, memory, and topology managers. A weekend with a tags file and the test suite will get you further than any documentation.

Authoritative docs

Source-tree pointers

KEPs that shaped this

Pair this page with the pod lifecycle sub-page (the cluster-side of everything described here), architecture (why the kubelet sits where it does), and scheduler (what runs upstream of the kubelet). For hands-on understanding, the pod eviction simulator lets you set thresholds and watch the ranking algorithm pick victims; the probe generator walks through readiness/liveness/startup combinations; and the resource calculator picks systemReserved and kubeReserved values for a given node size.

One closing observation. The kubelet is the most "operational" piece of Kubernetes — the part you find yourself debugging at three in the morning when a node is on fire. Almost every production incident eventually surfaces in journalctl -u kubelet, and almost every fix involves understanding which subsystem is unhappy. Read the SyncLoop, then read the volumeManager, then read the eviction manager, in that order; the first two are where most slow problems live and the third is where most fast ones do. The codebase is large but surprisingly readable, and a habit of pulling up the source rather than guessing will pay off more often than any other single skill in operating Kubernetes at scale.

Next in the internals series

Keep going.

Pod scheduling, end to end

From pending Pod to running container, through the scheduler framework and the kubelet SyncLoop.

The scheduler framework

Plugins, extension points, percentageOfNodesToScore. What runs upstream of the kubelet.

Cluster architecture

Eight processes, one storage primitive — where the kubelet sits in the bigger picture.

Read ↑

Back to the internals index

All twelve sub-pages. The system on one canvas.

Index

Found this useful?

One binary,one node, one truth.

One binary per node, and what it isn't.

Three pod sources, merged in priority order.

SyncPod, SyncTerminatingPod, SyncTerminatedPod.

CRI. The gRPC contract with the runtime.

CSI. NodePublishVolume and the staging area.

Probes. Three kinds, one worker pool.

Eviction. Six signals, two thresholds, one ordering.

cgroups, the :10250 API, device plugins, and where to read more.

Keep going.

One binary,
one node, one truth.