Sub-page 03 · for infra + operator authors

Kubernetes internals · Pod lifecycle

One Pod, eleven plugins,
a thousand syscalls.

The complete trace of a single Pod from the moment it lands in etcd as a Pending object with no nodeName to the moment its first container is serving traffic. The scheduler picks a node, the kubelet on that node observes the watch, the container runtime executes the plan, and probes start firing.

Roughly 4,500 words. Pair it with the architecture sub-page for the eight-process map, and the eviction simulator for the decision tree at the end.

From etcd insertion to "Pending".

A Pod's life begins as a write to etcd, mediated by the api-server, with no node assigned and no containers running. When you run kubectl run, when a ReplicaSet controller decides it needs an extra replica, when the Job controller spawns the next attempt: in every case, the request that hits the api-server is a POST /api/v1/namespaces/{ns}/pods with a Pod body that has spec.nodeName empty. The api-server validates the schema, runs mutating then validating admission webhooks, defaults the un-set fields (the default restartPolicy, the default terminationGracePeriodSeconds), and writes the resulting object to etcd at /registry/pods/{ns}/{name}. At that moment the Pod's status.phase is Pending and its PodScheduled condition is False.

Pending is not a single state; it is a basin of attraction. It covers everything from "no node has been picked yet" to "node picked, image pulling" to "image pulled, sandbox starting". The phase only flips to Running when at least one container reports a non-empty state.running. This is why kubectl get pod can show Pending for ten seconds during normal start-up and ten minutes during a stuck image pull, and the user has to look at the conditions and events to tell the difference. The Pod's status is a tape of what every participant has reported, and the participants only report what they directly observe.

The first participant after the api-server is the scheduler. It watches every Pod in the cluster on a long-lived HTTP/2 stream filtered to objects where spec.nodeName == "". The watch event arrives within milliseconds of the etcd commit; in a healthy cluster the round trip from kubectl run to "scheduler has noticed" is under fifty milliseconds, almost all of it network. The scheduler does not act on the event synchronously, though. It enqueues the Pod into an internal priority queue called the activeQ, ordered by Pod priority (then creation timestamp), and picks the next item only when its single scheduling worker is free. Scheduling is, by design, single-threaded for a given scheduler instance: concurrency would race on the node-resource accounting cache.

The second participant is whichever controller cares about that Pod's owner. If the Pod was created by a ReplicaSet, the ReplicaSet controller is watching its owned Pods and will notice the new addition. If the Pod was created by a Job, the Job controller will. These owner-controllers are not what schedule the Pod; their concern is replica-count bookkeeping. They will, however, react to the Pod's eventual transitions, most importantly to Failed or Succeeded, when they may create a replacement.

Inside etcd, the Pod is a single key with a binary protobuf value. Its resourceVersion is the global etcd revision at which it was written, and that number is the cursor every watcher uses to resume after a disconnect. From here on, every state change to this Pod (the scheduler binding it, the kubelet reporting container statuses, a controller adding a label) is a separate Update request to the api-server, each of which becomes a new etcd revision. By the time a Pod is Running, it has typically accumulated twenty to forty distinct revisions of its object, all of which streamed past every watcher of the Pods resource.

Subtlety: status.phase is the legacy field and is intentionally lossy. The truth is in status.conditions: PodScheduled, Initialized, ContainersReady, PodReady. A pod can be Running but not Ready (probe failing); a controller that wires Ready as "phase == Running" will route traffic to a broken backend.

The scheduling framework: eleven extension points, in order.

The kube-scheduler binary used to be a hard-coded pipeline of "predicates then priorities". As of 1.19 it is a plugin host. The scheduling framework (KEP-624) defines eleven extension points; every built-in scheduler behaviour (node affinity, taint toleration, resource fit, topology spread) is a plugin that registers at one or more of those points. The host is small and stupid; the plugins do the work. This is the same architectural shift Linux did with the Netfilter framework, and it has the same benefit: behaviours can be added, removed, and reordered by configuration without forking the binary.

A scheduling cycle for one Pod walks the points in a fixed order. Sort, PreFilter, Filter, PostFilter (only if Filter found nothing), PreScore, Score, NormalizeScore, Reserve, Permit, PreBind, Bind, PostBind. The first six are read-only; the last five are write-emitting (Reserve mutates the cache, Bind mutates the cluster, PostBind emits side-effects). Each point has a registered list of plugins; each plugin returns a status that is either Success, Skip, Wait, Unschedulable, or Error. Success and Skip move on; Unschedulable short-circuits the cycle; Error fails the Pod with a retry; Wait parks the Pod until a Permit plugin releases it.

The framework's central invariant is that the cycle is purely functional up to Reserve. The scheduler can run hundreds of cycles per second, computing hypothetical placements, with no visible side-effect on the cluster. Reserve is the first commit point: it claims the resources on the node in the scheduler's local cache so that the next pod in the queue cannot double-book them. PreBind can do real-world work like provisioning a volume; Bind is the last step where the decision becomes visible to the rest of the cluster, by writing spec.nodeName via the Bind subresource. Until Bind succeeds, every previous step is reversible. The Pod stays Pending and the cycle restarts.

Custom schedulers and operator authors interact with the framework in two ways. The lightweight way is a KubeSchedulerConfiguration file, which lets you reorder built-in plugins, disable some, and configure their parameters per profile. A profile is named, and Pods opt into a profile by setting spec.schedulerName; you can run several profiles in the same binary. The heavyweight way is to compile your own scheduler binary that imports k8s.io/kubernetes/pkg/scheduler/framework, registers your custom plugin, and runs as an additional scheduler alongside the default. The scheduler-plugins project ships several useful out-of-tree examples: coscheduling, capacity-scheduling, node-resource-topology.

# A KubeSchedulerConfiguration with a custom profile that disables NodeResourcesBalancedAllocation
# and adds the coscheduling plugin from scheduler-plugins for batch workloads.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    plugins:
      score:
        disabled:
          - name: NodeResourcesBalancedAllocation
  - schedulerName: batch-scheduler
    plugins:
      queueSort:
        enabled:
          - name: Coscheduling
      preFilter:
        enabled:
          - name: Coscheduling
      permit:
        enabled:
          - name: Coscheduling
    pluginConfig:
      - name: Coscheduling
        args:
          permitWaitingTimeSeconds: 60

Operational rule: never write a Score plugin that calls the api-server inline. Score runs once per feasible node, so a 200-node cluster means up to 200 round trips per Pod, and every other Pod is blocked behind it. All data the plugin needs must be in informer-backed caches; the framework's Handle exposes SnapshotSharedLister for exactly this.

Predicates: the five filters every Pod walks through.

The Filter extension point in modern Kubernetes runs roughly fifteen built-in plugins, but five of them carry the load on a typical cluster: NodeAffinity, NodeUnschedulable, NodeResourcesFit, PodTopologySpread, and TaintToleration. If a Pod fails any of them on every node, it stays Pending forever (or until something about the cluster or the Pod changes), and the events log records "0/N nodes available". A working understanding of what each one rejects is the difference between debugging a stuck deployment in three minutes and three hours.

NodeAffinity implements both spec.nodeSelector (the legacy form, equality-only) and spec.affinity.nodeAffinity (the modern form, with operators In, NotIn, Exists, DoesNotExist, Gt, Lt). The plugin reads the Pod's required-during-scheduling rules and matches them against each Node's labels; if no required term matches, the node is filtered out. The preferred-during-scheduling rules live at the Score point, not Filter. They bias placement but never veto it.

NodeUnschedulable is the simplest plugin: if the Node has spec.unschedulable: true, the node is filtered out unless the Pod tolerates the node.kubernetes.io/unschedulable taint. This is the flag kubectl cordon sets. It is also why a cordoned node can still keep its existing pods: Filter only runs at scheduling time, never on already-bound pods. TaintToleration generalises the idea: a Node can carry one or more taints (key, value, effect), each with effect NoSchedule, PreferNoSchedule, or NoExecute; the plugin filters out nodes whose NoSchedule taints the Pod does not tolerate.

NodeResourcesFit is the plugin that reads requests on the Pod's containers and checks that the sum, plus what is already requested by other Pods on the candidate Node, fits within the Node's allocatable. Allocatable is not the Node's raw capacity; the kubelet reserves a slice for the system (--system-reserved) and the kubelet itself (--kube-reserved) and exposes the rest. The plugin is what makes resource requests load-bearing: a Pod with no requests can be scheduled to a node that is already at 100% memory because the predicate has no signal to reject it. This is the most common cause of OOM cascades in undersized clusters.

PodTopologySpread implements spec.topologySpreadConstraints, the modern, declarative version of zone spreading. A constraint says "across the topology key topology.kubernetes.io/zone, the maximum skew between zones for Pods matching this label selector should be 1". The plugin counts existing matching Pods per zone, computes the skew if a new Pod were added to each candidate node, and filters out any node where the skew would exceed maxSkew. This is what stops a ReplicaSet of 3 from putting all 3 replicas in one zone, even when the resource fit allows it.

# Pod with affinity, toleration, and topology spread — exercises four of the five predicates.
apiVersion: v1
kind: Pod
metadata:
  name: web-7d8
  namespace: prod
  labels:
    app: web
spec:
  containers:
    - name: web
      image: ghcr.io/acme/web:1.4.2
      resources:
        requests: { cpu: 250m, memory: 256Mi }
        limits: { cpu: 1, memory: 512Mi }
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - { key: kubernetes.io/arch, operator: In, values: [amd64] }
              - { key: node.acme.io/pool, operator: In, values: [general] }
  tolerations:
    - { key: dedicated, operator: Equal, value: web, effect: NoSchedule }
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels: { app: web }

When a Pod cannot fit, the scheduler emits an Event describing the per-plugin reasons, and you can read it with kubectl describe pod. The format is "0/12 nodes are available: 4 Insufficient cpu, 3 node(s) didn't match Pod's node affinity/selector, 5 node(s) had untolerated taint." That string is the predicate-by-predicate breakdown, and it is the single most useful diagnostic in the cluster. If you ever see "0/N nodes available" with no further detail, your scheduler is mis-configured to suppress the detail; un-suppress it.

Production gotcha: NodeResourcesFit does not consider limits, only requests. A Node where the sum of requests is 3 of 4 cores has 1 core "available" to the scheduler, even if every existing Pod has limit equal to capacity and is currently saturating the box. Use the resource calculator to plan requests against your real workload P95s, not nominal averages.

Scoring, and the "stop after N feasible" optimisation.

After Filter has produced a list of feasible nodes, Score runs every registered Score plugin against every feasible node and accumulates a per-node total. The default scoring plugins are NodeResourcesFit (least-allocated or most-allocated, by policy), NodeResourcesBalancedAllocation (prefer nodes whose CPU/memory ratios match the Pod's), ImageLocality (prefer nodes that already have the Pod's image), InterPodAffinity (the preferred form of affinity, applied as a soft signal), and the soft side of NodeAffinity and PodTopologySpread. Each plugin returns 0–100 per node; NormalizeScore can rescale; the final score is a weighted sum, weights configurable per profile.

The naive Score loop is O(plugins · feasible nodes), which is fine for clusters of dozens. At 5,000 nodes it is ruinous: a single Pod's score pass can take seconds, and the scheduler is single-threaded per cycle. The percentageOfNodesToScore setting, default 50 for clusters under 100 nodes and asymptotically lower for large clusters (the formula is max(5, 50 − clusterSize / 125)), is the optimisation that makes large clusters tractable. It is a Filter-time setting: the scheduler stops walking nodes as soon as that fraction of feasible candidates have been found, and Score runs only on those.

The trade-off is sampling bias. The scheduler walks the node list in a randomised but deterministic order seeded per cycle, so over many cycles every node has equal probability of being considered, but for a single Pod you might miss a node with an ImageLocality bonus that would have flipped the placement. In practice this is acceptable because the difference between a good and a great placement is usually in the noise of container start time. If you have a workload where placement quality matters more than latency (long-running batch jobs, ML training) you can tune percentageOfNodesToScore upward in a dedicated profile, accepting slower scheduling for that schedulerName.

# scheduler logs at -v=4 — the line that tells you what Score actually computed
I0501 09:14:22.881  4 schedule_one.go:225] "Successfully bound pod to node"
    pod="prod/web-7d8" node="node-3"
I0501 09:14:22.811  4 schedule_one.go:148] "Scheduling pod"
    pod="prod/web-7d8"
I0501 09:14:22.812  4 generic_scheduler.go:480] "Filter passed"
    pod="prod/web-7d8" feasibleNodes=14 totalNodes=312 sampledFraction=0.05
I0501 09:14:22.815  4 generic_scheduler.go:540] "Score result" pod="prod/web-7d8"
    node="node-3"  total=187 NodeResourcesFit=72 ImageLocality=85 PodTopologySpread=30
    node="node-7"  total=164 NodeResourcesFit=78 ImageLocality=0  PodTopologySpread=86
    node="node-12" total=158 NodeResourcesFit=68 ImageLocality=0  PodTopologySpread=90
I0501 09:14:22.816  4 generic_scheduler.go:565] "Selected node"
    pod="prod/web-7d8" node="node-3" reason="highest score"

The log output above reads like a small scoreboard, and that is exactly what it is. In a healthy cluster you can answer "why was this Pod scheduled here" in one grep over the scheduler log. If the scheduler is run with -v=2, only the success line is logged; bumping to -v=4 gives the per-plugin breakdown, and the cost is roughly doubled log volume but no measurable scheduling latency increase. Most production teams default to -v=2 and bump to 4 only when investigating.

A subtlety worth holding for capacity planners: the scheduler's scoring is local, not global. It picks the best node it has seen, not the best node in the cluster. With percentageOfNodesToScore at 5%, on a 300-node cluster, it scores 15 nodes and binds. There is no cross-Pod optimisation, no "what if I had scheduled the next ten Pods together" planner. If you want bin-packing-quality placement, you need a custom scheduler that batches across pods, like capacity-scheduling or a third-party scheduler like Volcano. The default scheduler is a greedy single-shot placer, and that is by design — it has to make every decision under tens of milliseconds.

Tuning rule: if you are seeing scheduling latency above 200ms p99, look at three things in order: scheduler queue depth (saturated?), percentageOfNodesToScore (too high for cluster size?), and informer cache freshness (stale Node objects mean re-fetch). The scheduler's metrics surface all three at :10259/metrics.

Bind: writing nodeName via the /binding subresource.

The output of every successful scheduling cycle is a single API call: a POST to /api/v1/namespaces/{ns}/pods/{name}/binding. The Bind subresource is unusual: it accepts a Binding object with one meaningful field, target.name, the chosen node, and the api-server's handler for it does exactly one thing: UPDATE the Pod, setting spec.nodeName to that target. It cannot be used to set anything else; there is no "schedule this Pod with this annotation" composite verb. Scheduling and editing are separate.

The reason Bind is a subresource and not a normal Pod update is access control. The default RBAC binding for the kube-scheduler ServiceAccount grants create on pods/binding, but not update on pods. This means the scheduler can place a Pod onto a node, but cannot change anything else about it: not its image, not its environment variables, not its annotations. The blast radius of a compromised scheduler is exactly "wrong placement", which the cluster recovers from by rescheduling. That is a much smaller compromise than "the scheduler can mutate any Pod".

# What the scheduler sends. A real wire capture, simplified.
POST /api/v1/namespaces/prod/pods/web-7d8/binding HTTP/2
Authorization: Bearer eyJhbGciOiJSUzI1NiIs...
Content-Type: application/json

{
  "apiVersion": "v1",
  "kind": "Binding",
  "metadata": { "name": "web-7d8", "namespace": "prod" },
  "target": {
    "apiVersion": "v1",
    "kind": "Node",
    "name": "node-3"
  }
}

# Server response — 201 Created, empty body.
# Internally, the api-server has issued an etcd Txn that updates /registry/pods/prod/web-7d8
# with spec.nodeName = "node-3" and bumped resourceVersion.

Once Bind succeeds, the scheduler considers the Pod placed and removes it from its internal queue. The api-server, having committed the update, fans the change out via watch. Every watcher of Pods receives a MODIFIED event for web-7d8 with spec.nodeName: node-3 set. The scheduler itself ignores the event (it has already moved on). The kubelet on node-3, whose watch is filtered to spec.nodeName == node-3, sees a new pod arrive in its responsibility set. Other kubelets receive nothing, because their watch filters do not match.

This is also where preemption interacts with Bind. If Filter cannot find any feasible node for a high-priority Pod, the PostFilter extension point runs the DefaultPreemption plugin. It walks lower-priority Pods on infeasible nodes, tries to find a victim set whose deletion would make at least one node feasible, and then issues two writes: a delete of the victims and a Bind of the new Pod. The Bind is delayed: the scheduler issues it only after the api-server confirms the victims are gone. This is one of the few places the scheduler waits on the api-server inside a cycle; it is also why preemption can take a noticeable second or two of wall time.

Practical hint: you can simulate the scheduler's Bind manually with kubectl create against the binding subresource, which is how you pin a Pending Pod to a node when the scheduler is broken. Useful for incident response; never use it routinely. The Pod is still owned by whatever ReplicaSet created it; you are just doing the scheduler's job by hand.

The kubelet sync loop: observe, diff, act.

When the kubelet on the chosen node sees the watch event with its own name in spec.nodeName, the Pod has officially been delivered. From this point the scheduler is irrelevant; everything that happens next is local to the node and driven by the kubelet's main loop, called the SyncLoop. The loop runs continuously, ticking roughly every ten Hz, and processes events from four sources: the api-server watch, a file-watch on /etc/kubernetes/manifests/ (for static pods), an HTTP-pull source (legacy), and a periodic re-sync timer. All four feed into a single channel of PodUpdate events.

Each iteration of the loop reconciles one Pod. The kubelet computes the desired state (the Pod spec it last saw) and the actual state, what the container runtime says is running, queried via CRI. It diffs them, generating a list of operations: containers to create, containers to start, containers to kill, volumes to mount, probes to start. The operations are then executed sequentially in SyncPod, which is the heart of the kubelet. Static pods (those in /etc/kubernetes/manifests/, used to bootstrap the control plane on the kubeadm path) follow the same code path, with the file-watch substituting for the api-server watch.

The SyncLoop is intentionally pull-based and idempotent. The kubelet does not maintain a queue of "things to do next"; on every iteration, it asks "what is the current state of this Pod, and what does its spec say it should be", and acts on the diff. If the kubelet is killed mid-operation and restarted, the next iteration will see whatever progress was made, diff against the spec, and continue. Crash recovery is automatic. This is the same control-loop pattern every Kubernetes controller follows, and it is the reason kubelet crashes are usually invisible to the workloads on the node: the containers were running before the kubelet died, and the runtime kept them running while the kubelet was down.

// pseudocode of the kubelet sync loop, simplified from pkg/kubelet/kubelet.go
// real source: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L2168

func (kl *Kubelet) syncLoop(ctx context.Context, updates <-chan kubetypes.PodUpdate) {
    plegCh := kl.pleg.Watch()                // container lifecycle events from CRI
    syncTicker := time.NewTicker(1 * time.Second)
    housekeepTicker := time.NewTicker(2 * time.Second)

    for {
        select {
        case u := <-updates:
            // api-server watch fired (or static-pod file changed)
            kl.HandlePodAdditions(u.Pods)
            kl.HandlePodUpdates(u.Pods)
            kl.HandlePodRemoves(u.Pods)
        case e := <-plegCh:
            // container died/started — re-sync that pod against its spec
            kl.HandlePodSyncs([]*v1.Pod{ kl.podManager.GetPodByUID(e.ID) })
        case <-syncTicker.C:
            // every 1s, re-sync any pod whose status is stale
            kl.HandlePodSyncs(kl.getPodsToSync())
        case <-housekeepTicker.C:
            // every 2s, GC dead containers, prune images, clean orphan volumes
            kl.HandlePodCleanups(ctx)
        case <-ctx.Done():
            return
        }
    }
}

// SyncPod is called by every Handle* — same idempotent diff-and-act path.
func (kl *Kubelet) SyncPod(ctx, pod, podStatus, mirrorPod, updateType) error {
    // 1. compute desired state from spec
    // 2. ensure pod sandbox exists (CRI: RunPodSandbox)
    // 3. ensure init containers in order (CRI: CreateContainer + StartContainer)
    // 4. ensure regular containers (CRI: same)
    // 5. start probes; report status back to api-server
    return nil
}

Operational pattern: the kubelet's last-modified field on a Pod is the canonical "is this kubelet still alive" signal for the controller-manager's NodeLifecycle controller. If it stops, after the node-monitor-grace-period (default 40s) the Node is marked NotReady, and after tolerationSeconds for node.kubernetes.io/not-ready (default 300s) the Pods on it are deleted by the eviction controller, distinct from the kubelet-level eviction in Part 09.

CRI calls: RunPodSandbox, CreateContainer, StartContainer.

The Container Runtime Interface (CRI) is a gRPC interface that the kubelet uses to talk to whichever container runtime is installed on the node, almost always containerd in modern installs, occasionally CRI-O. The runtime exposes the interface on a Unix socket, typically /run/containerd/containerd.sock, and the kubelet connects with a gRPC client at start-up. CRI was introduced in 1.5 to break a hard dependency on Docker; today, no Kubernetes binary contains code that knows specifically how to start a container. Everything is delegated.

CRI splits the world into two services. RuntimeService handles execution: pod sandboxes, containers, exec, attach, port-forward, stats. ImageService handles images: pull, list, status, remove. The kubelet calls both, but the order matters. To start a Pod, the kubelet first calls RunPodSandbox on RuntimeService, which creates the network namespace, the cgroup, and a "pause" container that holds the namespace open. Then for each container it calls PullImage on ImageService (skipped if the image is already present), CreateContainer on RuntimeService (which prepares the container's filesystem and config), and finally StartContainer (which actually execs the binary).

The pod sandbox is the architectural unit that makes a Pod a Pod. It is one network namespace, one PID namespace (optional), and one cgroup hierarchy that all the pod's containers share. The pause container is a tiny C binary whose only job is to call pause(2) in a loop and hold those namespaces open while the other containers come and go. When you see a process called /pause on a node, that is what it is. If the runtime is containerd, the pause image is bundled into the binary and never pulled.

// CRI v1, simplified from k8s.io/cri-api/pkg/apis/runtime/v1
service RuntimeService {
    rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse);
    rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse);
    rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
    rpc StopContainer(StopContainerRequest) returns (StopContainerResponse);
    rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse);
    rpc Exec(ExecRequest) returns (ExecResponse);
    // ... ~30 RPCs total
}

service ImageService {
    rpc PullImage(PullImageRequest) returns (PullImageResponse);
    rpc ListImages(ListImagesRequest) returns (ListImagesResponse);
    rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse);
    rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse);
}

# A real RunPodSandbox request, captured with crictl --runtime-endpoint=... --debug
RunPodSandboxRequest {
    config {
        metadata { name: "web-7d8" namespace: "prod" uid: "5af1..." attempt: 0 }
        hostname: "web-7d8"
        log_directory: "/var/log/pods/prod_web-7d8_5af1..."
        dns_config { servers: ["10.96.0.10"] searches: ["prod.svc.cluster.local"] }
        linux {
            cgroup_parent: "/kubepods/burstable/pod5af1..."
            security_context { namespace_options: { network: POD pid: CONTAINER } }
        }
    }
}
# containerd response: { pod_sandbox_id: "9e3a..." } — a 64-char hex string
# at this point the network namespace exists, CNI has been called, the pod has its IP

CNI is called inside RunPodSandbox, not separately. The runtime (containerd) reads the CNI config from /etc/cni/net.d/, picks the configured plugin, and exec's the binary in /opt/cni/bin/ with a JSON payload on stdin describing the new sandbox's network namespace. The plugin (Cilium, Calico, Flannel) does its work (allocates an IP, creates a veth pair, plumbs routes) and prints the result on stdout. Containerd parses the result, attaches it to the sandbox, and returns. From the kubelet's perspective, this is all hidden behind one gRPC call.

Image pulls are the most common slow step in pod start-up, and the most operationally visible. The kubelet emits Pulling and Pulled events for each image, which is why kubectl describe pod on a Pending pod often shows "Pulling image": the pod is in Pending because the runtime is still streaming layers from the registry. Image pull policy is Always, IfNotPresent (default for non-:latest tags), or Never. Pull credentials come from imagePullSecrets on the Pod or from node-level credential providers configured in the kubelet.

Diagnostic: crictl ps, crictl pods, crictl images, crictl logs are the kubelet's CRI client surface, dropped into a CLI. When kubelet and api-server disagree about what is on a node, crictl is the source of truth: it asks the runtime directly. Keep it on every node alongside kubectl.

Probes: liveness, readiness, startup.

Once a container is started, the kubelet starts running its probes. There are three kinds, with different meanings, and they fire on independent timers. A livenessProbe answers "is this container's process healthy". A readinessProbe answers "is this container ready to take traffic". A startupProbe answers "has this container finished its slow start-up sequence yet". They share a wire format (HTTP GET, TCP connect, gRPC health, or exec command) but the kubelet acts on each one differently, and getting that difference wrong is the most common cause of self-induced production outages in Kubernetes.

Liveness probes restart containers. If a liveness probe fails for failureThreshold consecutive iterations, the kubelet sends SIGTERM to the container, waits terminationGracePeriodSeconds, then SIGKILL, and asks CRI to remove and recreate the container. The Pod stays in place; the Pod is not rescheduled. Liveness is a within-Pod restart loop. The classic misuse is to put liveness on a slow endpoint that hits the database — when the database is loaded, liveness fails, the container restarts, the cold start hits the database harder, you have built a denial-of-service against yourself.

Readiness probes flip the PodReady condition. When readiness passes, PodReady=True and the EndpointSlice controller adds the Pod's IP to the Service's EndpointSlice; kube-proxy reads that and starts forwarding traffic. When readiness fails, the EndpointSlice removes the IP and traffic stops within one EndpointSlice update cycle (typically under a second). Readiness does not restart the container. It is the right place to express "I am alive, but currently overwhelmed" — during a heavy GC pause, while a cache warms, while a Kafka rebalance completes. Failed readiness sheds load without flapping the container.

Startup probes were added in 1.16 to disentangle "long initial start-up" from "ongoing health". Before startup probes, you had to set livenessProbe.initialDelaySeconds high enough to cover the worst-case start, which made a hung container take that long to be killed. With a startup probe, liveness and readiness are suspended until startup passes; once it does, liveness takes over with its tighter timing. Use startup for any container that takes more than five seconds to come up — JVMs, large Python imports, database engines.

# Probe configuration that gets the trade-offs right.
# The startup probe gives the JVM 90s; once started, liveness restarts on hang, readiness sheds load.
spec:
  containers:
    - name: app
      image: ghcr.io/acme/app:1.4.2
      startupProbe:
        httpGet: { path: /health/started, port: 8080 }
        periodSeconds: 5
        failureThreshold: 18    # 18 × 5s = 90s budget
      livenessProbe:
        httpGet: { path: /health/live, port: 8080 }
        periodSeconds: 10
        timeoutSeconds: 2
        failureThreshold: 3
      readinessProbe:
        httpGet: { path: /health/ready, port: 8080 }
        periodSeconds: 3
        timeoutSeconds: 1
        failureThreshold: 2

Two patterns worth holding. First, liveness and readiness should hit different endpoints. Live should report only "is the process able to serve at all". Ready should report "is the process able to serve right now". A common bug is a single /health endpoint used for both, which means a transient downstream timeout that fails readiness also fails liveness, restarting the container when it should have been left alone. Second, every probe budget should be longer than the slowest expected response under load. Liveness with a 1-second timeout against a service whose P99 is 800ms will, sooner or later, restart your fleet during a traffic spike.

Generator: the probe generator takes a P95/P99 SLO and a desired restart sensitivity and emits a probe block whose thresholds satisfy both, including the math for periodSeconds × failureThreshold.

Eviction: the kubelet's decision tree under pressure.

Eviction is the last act in many Pods' lifecycle, and it is where the kubelet stops being a reconciler and starts being a resource arbiter. Two distinct mechanisms share the name. API-initiated eviction is a POST to /api/v1/namespaces/{ns}/pods/{name}/eviction; it is what kubectl drain uses, and it respects PodDisruptionBudgets. Node-pressure eviction is the kubelet, on its own, killing pods because the node is running out of memory or disk. These two are unrelated in code and in policy; this section is about the second.

The kubelet samples four signals on its housekeeping tick: memory.available, nodefs.available, nodefs.inodesFree, and imagefs.available. Each has a soft threshold (with a grace period) and a hard threshold (immediate). When a soft threshold is breached for longer than its grace, or a hard threshold breaches at all, the kubelet enters eviction mode. It picks one or more victim Pods, asks the runtime to kill them, and records an Event. The user sees a Pod with status reason Evicted.

Victim selection is the interesting part. The kubelet sorts pods by, in order: whether they exceed their request (BestEffort and Burstable pods that exceed their request go first; Guaranteed pods are last), pod priority (lower goes first), and how much they are using of the pressed resource. So a BestEffort pod using 500MiB of memory on a memory-pressed node will be killed before a Guaranteed pod using 5GiB. This is the QoS class system in action: QoS exists exactly to give the kubelet a tie-break under pressure.

# /var/lib/kubelet/config.yaml — what eviction thresholds actually look like
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available:   "100Mi"
  nodefs.available:   "10%"
  nodefs.inodesFree:  "5%"
  imagefs.available:  "15%"
evictionSoft:
  memory.available:   "500Mi"
  nodefs.available:   "15%"
evictionSoftGracePeriod:
  memory.available:   "1m30s"
  nodefs.available:   "2m"
evictionMaxPodGracePeriod: 60
evictionMinimumReclaim:
  memory.available:   "200Mi"
  nodefs.available:   "500Mi"

The interaction with the OOM killer is worth pulling out. The kubelet's eviction is userspace; it polls every ten seconds and reacts to thresholds. The kernel's OOM killer is kernel-space; it fires when allocations actually fail. They both kill containers, but the kubelet kills pods proactively (before allocations fail) while the OOM killer kills processes reactively. On a hard memory-spike, the kernel wins the race; you will see a container with exit code 137, no kubelet eviction event, and the pod restarted by the kubelet under its normal restart policy. The eviction thresholds exist so the kubelet wins more races, by freeing memory before the kernel runs out.

The eviction simulator visualises the whole tree against a real workload mix — see /simulators/pod-eviction/ for a live decision-tree walker, including the per-QoS victim ranking and the soft-vs-hard threshold dynamics. The simulator runs the same algorithm as the kubelet's eviction_manager.go, simplified for visibility.

One last subtlety: Pods evicted for node-pressure reasons are kept in the api-server with status.phase: Failed and reason Evicted until something cleans them up. The TTL controller does not touch them; you need either --terminated-pod-gc-threshold on the controller-manager (default 12,500 cluster-wide) or a periodic kubectl delete pod --field-selector=status.phase=Failed sweep. Forgetting this is how clusters end up with thousands of dead Evicted pods cluttering kubectl output.

Keep going.

The controller pattern

Informers, listers, work queues, reconciliation. The pattern every built-in and custom controller follows.

Architecture

Eight processes, one storage primitive — the control plane and data plane on one canvas.

The lifecycle of kubectl apply

Twelve hops from the keystroke to the running pod, named, timed, explained.

Read ↑

Back to the internals index

All twelve sub-pages — four live, eight planned — and the system on one canvas.

Index

Found this useful?

One Pod, eleven plugins,a thousand syscalls.

From etcd insertion to "Pending".

The scheduling framework: eleven extension points, in order.

Predicates: the five filters every Pod walks through.

Scoring, and the "stop after N feasible" optimisation.

Bind: writing nodeName via the /binding subresource.

The kubelet sync loop: observe, diff, act.

CRI calls: RunPodSandbox, CreateContainer, StartContainer.

Probes: liveness, readiness, startup.

Eviction: the kubelet's decision tree under pressure.

Further reading: kubernetes.io, source pointers, KEPs.

Keep going.

One Pod, eleven plugins,
a thousand syscalls.