What happens between kubectl apply and a running container
Eleven control-plane events between you typing the command and the container actually running. Each one a small, precisely scoped piece of work — and not a single one is procedural.
The Kubernetes control plane and what runs where
The API server, scheduler, controllers, and etcd, and their jobs.
Kubernetes pod creation is an eleven-step dance between the API server, etcd, the scheduler, the kubelet, and the container runtime. From kubectl apply to a running container takes typically 5-30 seconds. Each step is a separate watch-driven event; understanding the sequence is the foundation of debugging Kubernetes.
Kubernetes' design is deceptively flat. There are only four kinds of components you need to know to understand pod creation: the API server, etcd, a fleet of controllers (the scheduler being one), and per-node kubelets. None of them call each other directly. Every interaction goes through the API server.
Validates and admits every request
The only thing that reads or writes etcd. Authenticates, authorizes, mutates, validates, persists, and then broadcasts. Every controller is a watch client.
Stores the cluster’s desired state
Strongly consistent key-value store, Raft-replicated across 3 or 5 members. Holds every cluster object. If etcd loses quorum, the cluster goes read-only.
Drive actual state toward desired
Independent loops watching specific resources and nudging reality toward the declared spec. Scheduler, deployment, replica-set, endpoint, kubelet — all the same shape: watch + reconcile.
How the watch protocol keeps components in sync
Components watch the API server for changes instead of polling.
Every controller in Kubernetes opens a streaming HTTP connection to the API server with ?watch=1 and stays connected forever. Each event — ADDED, MODIFIED, DELETED — streams down as it happens. Restart a controller? It does a full LIST first to bootstrap, then resumes the watch from the last resourceVersion it saw.
This is what makes Kubernetes feel reactive instead of polled. When the scheduler assigns a pod, the kubelet on that node knows within milliseconds — not because anyone told it, but because it was already watching, and the API server pushed the change down its open connection.
A 5,000-node cluster has 5,000 kubelets. None of them want every pod event in the cluster. Each kubelet asks for fieldSelector=spec.nodeName=node-X — the API server filters server-side, and only pods bound to that node stream to that kubelet. Bandwidth scales linearly, not quadratically.
The eleven events that create a pod, in order
From the API write to the container actually running.
The simulator below plays a single kubectl apply -f pod.yaml from the command landing on the API server through the container running on node-3. Five lanes, eleven hops. Use + Show wire bytes to see the exact payload at each step.
Everything starts with one HTTP POST. kubectl is a thin client that turns YAML into JSON and pushes it to the API server. There is no kubectl-side intelligence — every decision lives in the cluster.
How the scheduler picks a node: filter, then score
First rule out nodes that can't fit, then rank the ones that can.
The default scheduler is a small, well-defined program. Every unscheduled pod goes through two passes: filter (predicates that say no), then score (priorities that say better-or-worse). The highest-scoring node wins.
Is this node even possible?
Resource fit (does it have CPU/memory free?), volume attachment, taints, node selectors, affinity rules. Any predicate failure removes the node entirely. Nodes that fail here are invisible.
Among the eligible, which is best?
Least-requested (prefer less-loaded), image-locality (already-cached image saves a pull), topology-spread (keep replicas across zones), inter-pod affinity. Each scores 0–100; the weighted sum picks the winner.
This shape — filter then score — is what makes the scheduler pluggable. You can write a custom scheduler plugin and add it to either pass. The scheduling framework calls every plugin in order; one plugin's no halts filtering, but every plugin contributes to scoring.
The kubelet: the agent that runs pods on each node
It takes the assigned pod spec and makes the container real.
Kubelet is just another controller — it watches pods, reconciles to the desired state. What's special is what reconcile means here: it has to make Linux processes happen. The kubelet itself never starts a container; it tells the container runtime to.
The runtime — containerd, cri-o, historically Docker — speaks a gRPC interface called the Container Runtime Interface. Behind that interface, runc (or kata, gvisor, youki) actually clone()s the process into a fresh set of Linux namespaces, applies cgroups, mounts the rootfs, and execs the container command.
- CRI
RunPodSandbox
Creates the pause container — a tiny, almost-empty process that holds the pod's network namespace. Every container in the pod joins this sandbox. The CNI plugin attaches the network here, and the IP belongs to the sandbox, not the workload.
- CRI
PullImage · CreateContainer · StartContainer
Three separate gRPC calls. The image is fetched and unpacked; a writable overlay is stacked on top of the read-only layers; the runtime forks the entrypoint into the sandbox's namespaces. Standard out and standard err are streamed back through the runtime to the kubelet to the API server's log endpoint.
- CNI
Network setup
The CNI plugin (Calico, Cilium, AWS VPC CNI, Flannel) is invoked once per sandbox. Allocates an IP from the pod CIDR, creates a veth pair, attaches one end to the sandbox's network namespace and the other to a host bridge / tunnel / eBPF program. Returns the IP back to kubelet for the status update.
- probes
Liveness + readiness
Once started, kubelet begins firing the configured probes. Readiness failure removes the pod from Service endpoints (no traffic) but leaves it running. Liveness failure kills and restarts the container. Wrong probe config is the leading cause of "my pod keeps restarting."
Reconciliation loops never stop running
Controllers keep nudging actual state toward the state you declared.
The eleven steps you just walked through are not a procedure. None of them is "called" by anything. Each component watches some piece of state and does its small job whenever the watched state changes. Pod creation is just one well-defined trajectory through that watch graph.
When you delete a pod: kubelet sees the deletion event, gracefully terminates the container, removes the network attachment, deletes the sandbox. When a node dies: the node-lifecycle controller sees the heartbeat stop, marks the node NotReady, and eventually evicts the pods, which the deployment controller notices and recreates, which the scheduler picks up — same loop, different starting state.
Procedural systems (do A, then B, then C) fail when any step fails — there's no clean resumption point. Reconciliation has no procedure to fail. Drift from desired state, in any direction, is just the next reconcile loop's input.
Where pods get stuck, and how to debug it
The common failure states and what each one is telling you.
The eleven-hop trace is also a debugging map. When a pod is "not running," it has stalled at one of those steps — and each has a recognizable signature.
Scheduler can't place it.
Stuck at step 7. Either no node has the resources requested, or every eligible node has a taint the pod doesn't tolerate, or volume can't attach. kubectl describe pod shows scheduler events.
The node can’t pull the image
Stuck at step 10·CRI·PullImage. Wrong image name, wrong tag, missing imagePullSecret for a private registry, or registry is down. The kubelet retries with exponential backoff. The pod was scheduled fine.
Container starts, then dies.
Past step 10. The container is starting and exiting — bad config, missing env var, wrong entrypoint, dependency unreachable. Logs are everything here. kubectl logs --previous reads the dead instance.
Writing your own Kubernetes controller
The same watch-and-reconcile pattern is yours to use.
The control plane is not magic. It is a watch loop, a scheduler, and a per-node reconciler — and the watch loop is open to you. Define a Custom Resource Definition, write a controller that watches it, and reconciles toward the spec, and you have just extended Kubernetes.
This is the operator pattern. cert-manager watches Certificate CRs and provisions TLS via ACME. prometheus-operator watches ServiceMonitor CRs and updates Prometheus configs. The Postgres operator watches Postgres CRs and runs a HA cluster. Every one of them is the same shape as the pod flow you just walked through.
If you want to deepen this, see the K8s networking guide for what happens after the pod is running.
Pod startup latency: where the seconds go
The cold-start budget, broken down.
From kubectl apply to the container's first byte served, the typical pod takes 5-30 seconds. Where the time goes:
- API server admission + etcd write
- ~50-200ms. Validates the spec, runs admission webhooks, writes to etcd, broadcasts the watch event.
- Scheduler decision
- ~100-500ms. Filter and score nodes. Typically the fastest hop; bottlenecks here usually mean an expensive admission webhook or scheduler plugin.
- Image pull
- ~1-30s. Dominant for cold image cache. Can be cut with image-streaming (containerd's lazy image-pull, AWS Soci) or by warming the local cache via DaemonSet.
- Container runtime sandbox creation
- ~100-500ms. The pause container, the network namespace, the cgroup tree, the storage mount. Fast unless you have many init containers.
- CNI plugin
- ~50-300ms. Allocate the pod IP, plumb the routes. AWS VPC CNI was historically slow (~1-2s) on cold ENIs; prefix-delegation (2021) cut this to ~100ms.
- Container start
- ~50ms-N seconds. Depends entirely on application startup. JVM cold start ~5-10s; Go binary ~50ms; Python with imports ~200-500ms.
- Readiness probe success
- ~periodSeconds × failureThreshold. Default ~30s before kubelet considers the pod Ready, even if the container started in 50ms. The most common 'why is my deploy slow' answer.
Scale issues. At thousands-of-pods rolling deploys, etcd's update rate and the kubelet's image-pull throughput become bottlenecks. EKS/GKE clusters >5,000 nodes routinely measure pod-start times in minutes during big rollouts. Mitigations: pre-pull images via DaemonSet, use container-runtime image streaming, cap image size aggressively (Distroless and Alpine help).
Kubernetes feels enormous from the outside, but the engine is small. One read-write path through the API server. A consistent store behind it. Watches, all the way down. Every concept that looks like new vocabulary is one of those four pieces in a different role. Internalize the trace and the rest is just labels for things you already understand. When you want to take each piece apart on its own, the Kubernetes internals notes walk the API server, scheduler, kubelet, and etcd one at a time.
Read
further.
- kubernetes.ioCluster ComponentsThe official architecture overview. Short, accurate, and the canonical name for every piece referenced in Part 01.
- kubernetes.ioScheduling FrameworkHow filter / score plugins compose. Read this before writing a custom scheduler — and before complaining about why a pod is unscheduled.
- kubernetes/communityAPI ConventionsWhy every object has metadata.spec and metadata.status, and how the watch protocol's resourceVersion works. Required reading for anyone writing controllers.
- Semicolony guideK8s networkingWhat happens after step 11 — pod-to-pod, pod-to-Service, Ingress. The CNI is where this guide ended; that one starts.
- Semicolony guideContainersNamespaces, cgroups, OCI — what runc actually does at step 10. Foundation for everything in Part 05.
- Semicolony toolProbe GeneratorGenerate the liveness / readiness / startup YAML referenced in Part 05. Sane defaults, full annotations.