13 min read · Guide · Kubernetes
How it works · Kubernetes

What happens between kubectl apply and a running container

Eleven control-plane events between you typing the command and the container actually running. Each one a small, precisely scoped piece of work — and not a single one is procedural.

Parts01 – 08 Interactive11-step trace PrereqContainers / etcd

The Kubernetes control plane and what runs where

The API server, scheduler, controllers, and etcd, and their jobs.

Kubernetes pod creation is an eleven-step dance between the API server, etcd, the scheduler, the kubelet, and the container runtime. From kubectl apply to a running container takes typically 5-30 seconds. Each step is a separate watch-driven event; understanding the sequence is the foundation of debugging Kubernetes.

Kubernetes' design is deceptively flat. There are only four kinds of components you need to know to understand pod creation: the API server, etcd, a fleet of controllers (the scheduler being one), and per-node kubelets. None of them call each other directly. Every interaction goes through the API server.

apiserver

Validates and admits every request

The only thing that reads or writes etcd. Authenticates, authorizes, mutates, validates, persists, and then broadcasts. Every controller is a watch client.

etcd

Stores the cluster’s desired state

Strongly consistent key-value store, Raft-replicated across 3 or 5 members. Holds every cluster object. If etcd loses quorum, the cluster goes read-only.

controllers

Drive actual state toward desired

Independent loops watching specific resources and nudging reality toward the declared spec. Scheduler, deployment, replica-set, endpoint, kubelet — all the same shape: watch + reconcile.


How the watch protocol keeps components in sync

Components watch the API server for changes instead of polling.

Every controller in Kubernetes opens a streaming HTTP connection to the API server with ?watch=1 and stays connected forever. Each event — ADDED, MODIFIED, DELETED — streams down as it happens. Restart a controller? It does a full LIST first to bootstrap, then resumes the watch from the last resourceVersion it saw.

This is what makes Kubernetes feel reactive instead of polled. When the scheduler assigns a pod, the kubelet on that node knows within milliseconds — not because anyone told it, but because it was already watching, and the API server pushed the change down its open connection.

Field selectors keep it scalable

A 5,000-node cluster has 5,000 kubelets. None of them want every pod event in the cluster. Each kubelet asks for fieldSelector=spec.nodeName=node-X — the API server filters server-side, and only pods bound to that node stream to that kubelet. Bandwidth scales linearly, not quadratically.


The eleven events that create a pod, in order

From the API write to the container actually running.

The simulator below plays a single kubectl apply -f pod.yaml from the command landing on the API server through the container running on node-3. Five lanes, eleven hops. Use + Show wire bytes to see the exact payload at each step.

request / response watch event internal
CLIENTkubectl~/.kube/config API SERVERkube-apiservercontrol-plane STOREetcdconsensus, durable SCHEDULERkube-schedulerpredicates + priorities NODE AGENTkubelet · node-3CRI + CNI 01REQUESTPOST /api/v1/.../pods 02INTERNALAuthN · AuthZ · Admission 03REQUEST put / pods / default / web-7d4c 04RESPONSE revision: 482137 05RESPONSE 201 Created · phase: Pending 06WATCH WATCH · pod added (no nodeName) 07INTERNALPredicates + priorities → node-3 08REQUEST PATCH spec.nodeName = "node-3" 09WATCH WATCH · pod assigned to node-3 10INTERNALCRI: pull image, create sandbox, start 11REQUEST PATCH status: phase=Running, podIP=10.244.3.42 ONE COMMAND · ELEVEN HOPS · ONE RUNNING POD
Step 01 of 11

Everything starts with one HTTP POST. kubectl is a thin client that turns YAML into JSON and pushes it to the API server. There is no kubectl-side intelligence — every decision lives in the cluster.

kubectl  →  kube-apiserver

How the scheduler picks a node: filter, then score

First rule out nodes that can't fit, then rank the ones that can.

The default scheduler is a small, well-defined program. Every unscheduled pod goes through two passes: filter (predicates that say no), then score (priorities that say better-or-worse). The highest-scoring node wins.

Predicates · hard filters

Is this node even possible?

Resource fit (does it have CPU/memory free?), volume attachment, taints, node selectors, affinity rules. Any predicate failure removes the node entirely. Nodes that fail here are invisible.

Priorities · soft scoring

Among the eligible, which is best?

Least-requested (prefer less-loaded), image-locality (already-cached image saves a pull), topology-spread (keep replicas across zones), inter-pod affinity. Each scores 0–100; the weighted sum picks the winner.

This shape — filter then score — is what makes the scheduler pluggable. You can write a custom scheduler plugin and add it to either pass. The scheduling framework calls every plugin in order; one plugin's no halts filtering, but every plugin contributes to scoring.


The kubelet: the agent that runs pods on each node

It takes the assigned pod spec and makes the container real.

Kubelet is just another controller — it watches pods, reconciles to the desired state. What's special is what reconcile means here: it has to make Linux processes happen. The kubelet itself never starts a container; it tells the container runtime to.

The runtime — containerd, cri-o, historically Docker — speaks a gRPC interface called the Container Runtime Interface. Behind that interface, runc (or kata, gvisor, youki) actually clone()s the process into a fresh set of Linux namespaces, applies cgroups, mounts the rootfs, and execs the container command.

  1. CRI

    RunPodSandbox

    Creates the pause container — a tiny, almost-empty process that holds the pod's network namespace. Every container in the pod joins this sandbox. The CNI plugin attaches the network here, and the IP belongs to the sandbox, not the workload.

  2. CRI

    PullImage · CreateContainer · StartContainer

    Three separate gRPC calls. The image is fetched and unpacked; a writable overlay is stacked on top of the read-only layers; the runtime forks the entrypoint into the sandbox's namespaces. Standard out and standard err are streamed back through the runtime to the kubelet to the API server's log endpoint.

  3. CNI

    Network setup

    The CNI plugin (Calico, Cilium, AWS VPC CNI, Flannel) is invoked once per sandbox. Allocates an IP from the pod CIDR, creates a veth pair, attaches one end to the sandbox's network namespace and the other to a host bridge / tunnel / eBPF program. Returns the IP back to kubelet for the status update.

  4. probes

    Liveness + readiness

    Once started, kubelet begins firing the configured probes. Readiness failure removes the pod from Service endpoints (no traffic) but leaves it running. Liveness failure kills and restarts the container. Wrong probe config is the leading cause of "my pod keeps restarting."


Reconciliation loops never stop running

Controllers keep nudging actual state toward the state you declared.

The eleven steps you just walked through are not a procedure. None of them is "called" by anything. Each component watches some piece of state and does its small job whenever the watched state changes. Pod creation is just one well-defined trajectory through that watch graph.

When you delete a pod: kubelet sees the deletion event, gracefully terminates the container, removes the network attachment, deletes the sandbox. When a node dies: the node-lifecycle controller sees the heartbeat stop, marks the node NotReady, and eventually evicts the pods, which the deployment controller notices and recreates, which the scheduler picks up — same loop, different starting state.

Why this design wins

Procedural systems (do A, then B, then C) fail when any step fails — there's no clean resumption point. Reconciliation has no procedure to fail. Drift from desired state, in any direction, is just the next reconcile loop's input.


Where pods get stuck, and how to debug it

The common failure states and what each one is telling you.

The eleven-hop trace is also a debugging map. When a pod is "not running," it has stalled at one of those steps — and each has a recognizable signature.

Pending · no node

Scheduler can't place it.

Stuck at step 7. Either no node has the resources requested, or every eligible node has a taint the pod doesn't tolerate, or volume can't attach. kubectl describe pod shows scheduler events.

ImagePullBackOff

The node can’t pull the image

Stuck at step 10·CRI·PullImage. Wrong image name, wrong tag, missing imagePullSecret for a private registry, or registry is down. The kubelet retries with exponential backoff. The pod was scheduled fine.

CrashLoopBackOff

Container starts, then dies.

Past step 10. The container is starting and exiting — bad config, missing env var, wrong entrypoint, dependency unreachable. Logs are everything here. kubectl logs --previous reads the dead instance.


Writing your own Kubernetes controller

The same watch-and-reconcile pattern is yours to use.

The control plane is not magic. It is a watch loop, a scheduler, and a per-node reconciler — and the watch loop is open to you. Define a Custom Resource Definition, write a controller that watches it, and reconciles toward the spec, and you have just extended Kubernetes.

This is the operator pattern. cert-manager watches Certificate CRs and provisions TLS via ACME. prometheus-operator watches ServiceMonitor CRs and updates Prometheus configs. The Postgres operator watches Postgres CRs and runs a HA cluster. Every one of them is the same shape as the pod flow you just walked through.

If you want to deepen this, see the K8s networking guide for what happens after the pod is running.

Pod startup latency: where the seconds go

The cold-start budget, broken down.

From kubectl apply to the container's first byte served, the typical pod takes 5-30 seconds. Where the time goes:

API server admission + etcd write
~50-200ms. Validates the spec, runs admission webhooks, writes to etcd, broadcasts the watch event.
Scheduler decision
~100-500ms. Filter and score nodes. Typically the fastest hop; bottlenecks here usually mean an expensive admission webhook or scheduler plugin.
Image pull
~1-30s. Dominant for cold image cache. Can be cut with image-streaming (containerd's lazy image-pull, AWS Soci) or by warming the local cache via DaemonSet.
Container runtime sandbox creation
~100-500ms. The pause container, the network namespace, the cgroup tree, the storage mount. Fast unless you have many init containers.
CNI plugin
~50-300ms. Allocate the pod IP, plumb the routes. AWS VPC CNI was historically slow (~1-2s) on cold ENIs; prefix-delegation (2021) cut this to ~100ms.
Container start
~50ms-N seconds. Depends entirely on application startup. JVM cold start ~5-10s; Go binary ~50ms; Python with imports ~200-500ms.
Readiness probe success
~periodSeconds × failureThreshold. Default ~30s before kubelet considers the pod Ready, even if the container started in 50ms. The most common 'why is my deploy slow' answer.

Scale issues. At thousands-of-pods rolling deploys, etcd's update rate and the kubelet's image-pull throughput become bottlenecks. EKS/GKE clusters >5,000 nodes routinely measure pod-start times in minutes during big rollouts. Mitigations: pre-pull images via DaemonSet, use container-runtime image streaming, cap image size aggressively (Distroless and Alpine help).



A closing note

Kubernetes feels enormous from the outside, but the engine is small. One read-write path through the API server. A consistent store behind it. Watches, all the way down. Every concept that looks like new vocabulary is one of those four pieces in a different role. Internalize the trace and the rest is just labels for things you already understand. When you want to take each piece apart on its own, the Kubernetes internals notes walk the API server, scheduler, kubelet, and etcd one at a time.

Found this useful?