Sub-page 02 · for infra + operator authors

Kubernetes internals · Apply lifecycle

Twelve hops from the keystroke
to the running pod.

You press Enter on kubectl apply -f deploy.yaml. Roughly two seconds later, a Pod is Running on a Node and serving traffic behind a Service. In between, your request walks through eight processes, four trust boundaries, two webhook chains, one consensus protocol, and somewhere between three and five controllers reconciling in series. This page is the full trace, in order.

Twelve hops, each named, timed with realistic ranges, and accompanied by a sequence diagram. Roughly 4,400 words. Pair it with the architecture sub-page for the static map, and the pod lifecycle sub-page for the kubelet-side machinery once the bind has happened.

The twelve hops, on one canvas.

Most of the time, when you read a sentence like “kubectl apply creates a Pod”, you mentally collapse the entire pipeline into one arrow. That arrow is, in fact, twelve arrows in series, with three different transports, two trust boundaries, and a dozen places where the request can be denied, mutated, queued, or split into many. The diagram below is the whole trace. The master view; everything else on this page is a zoomed slice of it.

Read it left to right, top to bottom. The four lanes are client, api-server, etcd, and node-side. The numbers are the twelve hops you will spend the rest of this page deep-diving into. The vertical axis is rough wall-clock time — not to scale, but ordered correctly. A green checkmark on the right edge means the request has succeeded all the way to a Pod that is Ready and receiving traffic.

Wall-clock budget for the entire path is roughly 1.5 to 4 seconds in a healthy cluster. The fastest hop is the watch fan-out (hop 8) at sub-millisecond — the api-server flushes the encoded event into already-open HTTP/2 streams, so notifying every controller is a memory copy and a flush. The slowest hop is hop 11, the kubelet's CRI work, which dominates whenever it has to pull a container image from a remote registry; for an image already cached on the node, the same hop is sub-second.

Hop	What happens	Where	Realistic time
01	kubectl resolves config + builds REST	client	~5 ms
02	TLS + authentication	api-server	~2 ms
03	Authorisation chain (RBAC)	api-server	~1 ms
04	Mutating admission webhooks	api-server	~5–80 ms
05	Schema + OpenAPI validation	api-server	~1 ms
06	Validating admission webhooks	api-server	~5–60 ms
07	etcd Txn — MVCC revision bump	etcd	~3–10 ms
08	Watch fan-out to controllers	api-server	<1 ms
09	Deployment → ReplicaSet → Pod	controller-mgr	~50–200 ms
10	Scheduler binds Pod to a Node	scheduler	~10–80 ms
11	kubelet observes, calls CRI	kubelet + CRI	~1–30 s
12	Pod Ready, EndpointSlice updated	kubelet + ep-controller	~50–500 ms

The single most important architectural property to internalise about this trace: every component except kubectl talks only to the api-server. The scheduler does not call the kubelet. The Deployment controller does not call the ReplicaSet controller; it writes a ReplicaSet object to the api-server, and a different process picks it up via watch. The cluster's coordination is entirely state-driven, and the state lives behind one front door.

Everything below is one annotated zoom into one of the bands above. If you read nothing else, read part 04 (admission) and part 06 (watch fan-out) — those are the two places where the most interesting Kubernetes behaviour lives, and where the most production incidents originate.

Hop 01 — kubectl resolves config and builds the REST request.

The first thing that happens when you press Enter is that kubectl does a surprising amount of work locally before any byte goes on the wire. It parses your command line, loads ~/.kube/config (and any files merged into it via KUBECONFIG), resolves the current context to a cluster URL and a credential, runs an exec-credential plugin if one is configured (the aws eks get-token dance, the gke-gcloud-auth-plugin, an OIDC provider's CLI), and builds a REST client pointing at https://api.k8s.example.com:6443. None of this touches the cluster.

Then it parses your YAML. The file is decoded into one or more unstructured.Unstructured objects in Go. For each object kubectl computes the last-applied-configuration annotation — a JSON serialisation of the object the user just provided, stored on the object itself so the next kubectl apply can do a three-way merge. (Server-side apply, the modern path, dispenses with this annotation and tracks ownership in a structured managedFields entry; we will treat both modes here.) Then kubectl looks up the GroupVersionKind in the api-server's discovery endpoint — /openapi/v3 on modern clusters — to find the right REST URL to PATCH or POST to.

That discovery call is itself a small drama. kubectl caches the discovery document in ~/.kube/cache/discovery/<cluster>/, refreshing it every ten minutes. On a cold cache it makes one large request and walks every APIService the cluster exposes. This is why kubectl apply sometimes has a one-second “thinking” delay on the first run after you switch contexts. If you ever wondered why your CI image has hundreds of files under ~/.kube/cache/ after a few minutes, that is what they are.

With the URL resolved, kubectl assembles the REST request. For a Deployment in the default namespace, it is PATCH /apis/apps/v1/namespaces/default/deployments/web with a content-type of application/apply-patch+yaml in the server-side-apply mode, or application/strategic-merge-patch+json in client-side. The body is your YAML, possibly transformed by a kustomize overlay or a kubectl plugin. The headers carry your bearer token (or a TLS client cert is set up at the transport layer) and a User-Agent like kubectl/v1.30.4 (linux/amd64) kubernetes/… which the api-server will record in its audit log.

The clearest window into all of this is kubectl get pods -v=8, which dumps the wire calls and headers as kubectl makes them. A real trace looks like:

# kubectl get pods -v=8 (abbreviated)
I0503 14:22:05.110  loading config file "/home/u/.kube/config"
I0503 14:22:05.118  GET https://api.k8s.example.com:6443/api/v1/namespaces/default/pods
I0503 14:22:05.118  Request Headers:
                       Accept: application/json;as=Table;...,application/json
                       Authorization: Bearer <redacted>
                       User-Agent: kubectl/v1.30.4 (linux/amd64) kubernetes/abc1234
I0503 14:22:05.144  Response Status: 200 OK in 26 ms
I0503 14:22:05.144  Response Headers:
                       Audit-Id: 8d3f...c91
                       Cache-Control: no-cache, private
                       Content-Type: application/json
I0503 14:22:05.146  Response Body: {"kind":"Table","apiVersion":"meta.k8s.io/v1","columnDefinitions":[...]}

Two more client-side things worth knowing. First, --dry-run=client does the YAML parse and the discovery lookup, then prints what would have been sent and stops; --dry-run=server sends the request with a ?dryRun=All query parameter, which makes the api-server run the full pipeline through admission and validation but skip the etcd write. Server dry-run is the right CI primitive — it tells you whether your manifest will actually be accepted by this cluster, with this set of admission webhooks installed, including any webhook that synthesises defaults. Second, kubectl is also where field-manager strings come from: every server-side-apply request includes a ?fieldManager=kubectl query parameter (you can override it), which the api-server records in .metadata.managedFields so future applies know which fields to preserve and which to overwrite.

Operational note — kubectl's HTTP transport reuses TCP connections via http.Transport, so back-to-back commands amortise the TLS handshake. In a fresh shell with no warm connection, the first command pays a 30–80 ms handshake cost; subsequent ones are sub-millisecond on the wire. CI runners that fork a new kubectl per resource are leaving this on the table.

Hops 02 + 03 — TLS, authentication, RBAC.

The api-server's request handler is a chain of small middlewares. Authentication is the first; authorisation is the second. Together they take roughly two milliseconds in the happy path, and they are non-negotiable: every single request — yours, every controller's, every kubelet's heartbeat — passes through both. There is no internal endpoint that bypasses them. If you ever read kube-apiserver's source for the first time, the file to start with is staging/src/k8s.io/apiserver/pkg/server/filters/, where each stage is a wrapped http.Handler.

Authentication is itself a chain. The api-server is configured with a list of authenticators, tried in order, and the first one that returns a non-anonymous identity wins. The standard list, from kubeadm-style installs: TLS client cert (the certificate's CN becomes the username, the O values become groups), bearer token (looked up against the static-token file or the ServiceAccount token signer), bootstrap token (used during node join), webhook authn (an external OIDC or LDAP service), and finally the anonymous authenticator, which assigns the user system:anonymous in group system:unauthenticated. ServiceAccount tokens are JWTs signed by the cluster's signing key, validated locally by the api-server without any network call.

Authorisation is also a chain, of authorisers. The default order on managed clusters is Node, RBAC, Webhook. Each authoriser returns one of three answers — Allow, Deny, or NoOpinion — and the verdict is: if any authoriser says Deny, the request is denied; if any says Allow, it is allowed; if all say NoOpinion, it is denied. The Node authoriser is a special-case authoriser that lets a kubelet only modify Pods scheduled to its own Node, and only certain fields, and is the reason a compromised kubelet cannot pretend to be a different kubelet. RBAC is the workhorse: the api-server reads ClusterRoleBindings and RoleBindings from its watch cache and matches the request's verb / resource / namespace against the rules bound to the user's groups.

It is worth being concrete. When you run kubectl apply -f deploy.yaml as the user alice@example.com, the request becomes PATCH apps/deployments in the default namespace. RBAC walks every binding for the groups system:authenticated and system:masters (and any custom group your IDP attached) and checks whether any of them grants patch on apps/deployments in default. If a binding does, you get an Allow. If none does, RBAC returns NoOpinion (not Deny — denial is rare, used by webhook authorisers), and unless the webhook authoriser overrides it, the request is rejected with 403 Forbidden and a human-readable explanation listing the rule that was missing.

A subtle and frequently missed detail: RBAC matches are evaluated against the requesting user, not the resource's owner or any future owner. A controller running as a ServiceAccount has whatever permissions you granted that ServiceAccount, no more. The most common cluster misconfiguration is a controller given cluster-admin because nobody wanted to enumerate the actual verbs it needs. The fix is always the same: turn on the api-server's audit log at RequestResponse level for one day, watch which calls the controller actually makes, and write a tight Role from that.

Operational note — RBAC is decided in microseconds because it is fully in-memory; the api-server keeps an authoriser cache that is updated reactively from the watch cache for RoleBinding and ClusterRoleBinding. A kubectl create rolebinding takes effect cluster-wide within one watch round-trip, typically under 100 ms.

Hops 04, 05, 06 — admission and schema validation.

Now things get interesting. The request has been authenticated and authorised; the api-server believes you are who you say you are, and that you are allowed to do what you are asking. But it is not yet willing to write your object. Three more stages stand between you and etcd: mutating admission webhooks (which can rewrite your object), schema validation (which checks the shape against the OpenAPI schema), and validating admission webhooks (which can reject the rewritten object). This is where almost every interesting policy in modern Kubernetes lives — Pod Security Admission, OPA Gatekeeper, Kyverno, sidecar injectors, image-policy webhooks, namespace defaulters, the whole lot.

Mutating admission goes first. The api-server walks every MutatingWebhookConfiguration in the cluster, picks out the ones whose rules match the GVK and operation, and calls each webhook's URL with an AdmissionReview request body. The webhook returns either an Allow with no patch (the object is unchanged), an Allow with a JSON-Patch or merge-patch (the object is rewritten in place), or a Deny (the request is rejected). The api-server applies the patches in order, so a sidecar injector can run before a label defaulter and the labels see the injected sidecar. Each webhook gets a configurable timeout (default 10 s, capped at 30); a webhook that fails open is harmless, a webhook that fails closed is a tripwire that can take the cluster down.

Then the api-server runs schema validation. This is the OpenAPI schema check: every field in your object must match the type declared in the published schema for its GVK, with the special x-kubernetes-* extensions (preserving unknown fields, int-or-string, list keys for SSA, etc.) honoured. In strict mode (the default for kubectl apply since 1.25), unknown fields are rejected with a clear error; in tolerant mode they are silently dropped. CEL validation rules — declared on CRDs via x-kubernetes-validations — also run here, in the api-server's process, for sub-millisecond latency. CEL is the modern way to express cross-field constraints without a webhook.

Finally validating admission. Same machinery as mutating: the api-server walks every ValidatingWebhookConfiguration, calls every matching webhook's URL, and accepts the request only if every webhook returns Allow. Validating webhooks cannot mutate. Their job is to inspect the post-mutation, post-schema-validation object and render a final yes-or-no verdict. This is where Pod Security Admission lives, where Image Policy lives, where most OPA / Gatekeeper rules live. If any validating webhook denies, the entire request is rejected with the webhook's status message bubbled up to kubectl.

The webhook protocol is small but exact. Here is what the api-server sends to a mutating webhook, abbreviated:

# POST https://webhook.example.com/mutate
# Content-Type: application/json
{
  "apiVersion": "admission.k8s.io/v1",
  "kind": "AdmissionReview",
  "request": {
    "uid": "8d3f...c91",
    "resource": { "group": "apps", "version": "v1", "resource": "deployments" },
    "operation": "UPDATE",
    "userInfo": { "username": "alice@example.com", "groups": ["dev"] },
    "object": { ...the post-decoding Deployment YAML... },
    "oldObject": { ...the previous version, for UPDATE ops... }
  }
}

# The webhook's response, with a JSON-Patch that adds a sidecar:
{
  "apiVersion": "admission.k8s.io/v1",
  "kind": "AdmissionReview",
  "response": {
    "uid": "8d3f...c91",
    "allowed": true,
    "patchType": "JSONPatch",
    "patch": "W3sib3AiOiJhZGQiLCJwYXRoIjoiL3NwZWMvY29udGFp..."  // base64'd JSON-Patch
  }
}

A correctly-built webhook server has three properties: it is fast (responds in under 100 ms even on cold start), it is highly available (run at least two replicas behind a Service), and it scopes its rules.namespaceSelector tightly so it does not get called for unrelated objects. The third one is the most commonly forgotten. A webhook that matches every Pod cluster-wide adds a hop to every Pod creation in the cluster, including the cluster's own kube-system pods, including the webhook itself, and you can produce a bootstrap deadlock by accidentally requiring the webhook to validate its own admission.

The newest path that replaces many simple validating webhooks is ValidatingAdmissionPolicy (GA in 1.30): you write CEL expressions inside a CRD, the api-server evaluates them in-process, and you avoid the webhook hop entirely. For most policy-as-code use cases, this is now the right primitive. See kubernetes.io · Validating admission policy.

Hop 07 — etcd: how an apply becomes a Txn.

The object has survived authentication, authorisation, mutation, schema validation, and validating admission. It is now a fully-baked, well-typed Kubernetes resource sitting in the api-server's memory. Hop 7 is the moment that resource is persisted: the api-server's storage layer translates the in-memory object into a protobuf-encoded byte string, computes the etcd key path, opens an etcd transaction, and commits.

The key path is mechanical and worth memorising — it tells you exactly how the api-server thinks about resources. For a namespaced resource the path is /registry/<resource>/<namespace>/<name>; for a cluster-scoped resource it is /registry/<resource>/<name>; for a CRD it is /registry/<group>/<resource>/<namespace>/<name>. A Pod called web-7d8 in prod lives at /registry/pods/prod/web-7d8; a ClusterRoleBinding called view lives at /registry/clusterrolebindings/view; a CRD instance of an argo.io/v1alpha1/Application called guestbook in argocd lives at /registry/argo.io/applications/argocd/guestbook.

# What etcd stores for a Pod — protobuf, viewed via auger.
$ ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt \
    --key=/etc/kubernetes/pki/apiserver-etcd-client.key \
    get /registry/pods/prod/web-7d8 -w json | jq '.kvs[0] | {key, mod_revision, version}'
{
  "key": "L3JlZ2lzdHJ5L3BvZHMvcHJvZC93ZWItN2Q4",  // base64 of /registry/pods/prod/web-7d8
  "mod_revision": 487293,                          // the resourceVersion you see in kubectl
  "version": 3                                       // per-key write count
}

The transaction is the more interesting part. The api-server does not just call etcd's Put; it issues a Txn with a compare-and-swap precondition. For an Update operation on an object the user thinks is at resourceVersion=487290, the transaction reads roughly if the current mod_revision of this key equals 487290 then put the new value, else fail. etcd's Raft handles this atomically: exactly one of any concurrent updates wins; the others get back “condition not met”, which the api-server translates into HTTP 409 Conflict with a body explaining that the resourceVersion the client sent is stale. Every controller in the cluster handles 409 by re-reading and retrying.

This compare-and-swap is the mechanism that keeps the cluster coherent in the face of concurrent writers. Two controllers updating the same Pod's status field in parallel cannot both win; one of them sees a 409 and reconciles fresh. Server-side apply turns this into a field-level merge: each writer claims a set of fields in .metadata.managedFields, the api-server merges non-overlapping changes from different field managers automatically, and only flags actual conflicts. The 409 disappears for non-overlapping writers; for overlapping ones, the user sees a precise error naming the field and the manager that owns it.

After the Txn commits, etcd's Raft has replicated the write to a quorum (in a 3-node etcd, to at least two members) before acknowledging. The api-server then bumps the global resourceVersion — which is just etcd's monotonically increasing revision — and returns it to the kubectl client in the response body. That number is now the watermark every other component will see when the watch event fans out.

A subtlety: the api-server also writes to its own watch cache before returning, so a subsequent GET on the same api-server replica will see the new value even before the watch event has propagated to any controller. This is the api-server doing read-your-writes consistency for clients hitting the same replica. Across replicas, with a sticky-session-less load balancer, you can briefly see a GET return the pre-write value if it lands on a replica whose watch cache has not yet caught up — fixed in 1.30 by KEP-3157 streaming-list-from-cache, which guarantees consistent reads from the cache.

Production gotcha — etcd disk fsync latency is the floor on every write hop. If your etcd's p99 fsync climbs past ~30 ms, every kubectl apply, every controller reconcile, every kubelet heartbeat slows down by that amount. The metric is etcd_disk_wal_fsync_duration_seconds; alert on the bucket above 50 ms.

Hop 08 — watch fan-out: from one write to many readers.

The instant the etcd transaction commits, every component in the cluster that cares about this resource needs to find out. There are dozens of such components: the Deployment controller is watching Deployments; the scheduler is watching unscheduled Pods; every kubelet is watching Pods scheduled to its own node; kube-proxy is watching Services and EndpointSlices; the HPA controller is watching HorizontalPodAutoscalers; an operator is watching its CRD. Hop 8 is the mechanism that broadcasts the change to all of them, in roughly one millisecond, over a single api-server.

The api-server maintains an in-memory watch cache per resource type. When a write commits, etcd's watch stream pushes the new revision to the api-server, the api-server appends it to the in-memory ring buffer for that resource, and notifies every long-poll watcher whose filter matches. Every watcher is a long-lived HTTP/2 stream; the api-server holds open thousands of these streams concurrently, one per controller-replica per watched resource. Notifying a watcher is a memory copy plus a write to the stream's buffer. There is no fan-out broadcast inside etcd; etcd has exactly one consumer of its watch, which is the api-server, and the api-server is the multiplexer.

The wire protocol is a chunked HTTP response with one JSON object per chunk. Each chunk is one of ADDED, MODIFIED, DELETED, BOOKMARK, or ERROR. The bookmark is the api-server's way of telling clients “nothing has changed for you, but the global revision has advanced to here” without resending a full event; this lets a controller checkpoint its resourceVersion cursor without burning bandwidth on noise. On the wire it looks like:

# GET /apis/apps/v1/deployments?watch=true&resourceVersion=487290 HTTP/2
# Content-Type: application/json; chunked

{
  "type": "ADDED",
  "object": { "kind": "Deployment", "metadata": { "name": "web", "resourceVersion": "487293" }, ... }
}
{
  "type": "MODIFIED",
  "object": { "kind": "Deployment", "metadata": { "resourceVersion": "487298" }, ... }
}
# …idle for 90s, server emits a bookmark…
{
  "type": "BOOKMARK",
  "object": { "kind": "Deployment", "metadata": { "resourceVersion": "487410" } }
}

On the consumer side, every controller built on client-go uses an informer: a wrapper that does an initial List at the current revision to populate a local cache, then opens a Watch starting from that revision to keep the cache fresh. The cache is queryable via a lister, which is just an indexed in-memory map. The controller's reconcile loop reads from the lister, never directly from the api-server, so a busy controller does not hammer the api-server with Gets — it queries its own local copy of the world. This pattern is the backbone of every well-built controller in Kubernetes; the controllers sub-page covers it in depth.

When the watch breaks — network partition, api-server rolling update, idle TCP timeout — the informer reconnects with the last resourceVersion it saw. The api-server replays from its watch cache if the revision is still in scope, or returns 410 Gone if the cache has rotated past it, in which case the informer does a fresh List and reconciles its cache from scratch. This re-list-on-410 pattern is why every well-written controller is idempotent: it has to be ready to be told the current state from scratch at any moment.

Operational note — the watch cache window is sized in objects, not in time. On clusters with very high write rates (heavy operator use, large fleets) the cache can rotate past a disconnected client's cursor in seconds, producing a storm of 410s and re-lists that itself causes an api-server CPU spike. Tuning is per-resource via --watch-cache-sizes; long-term remedy is the streaming-list KEP.

Hop 09 — the reconciliation cascade.

What you typed was a Deployment. A Deployment is, fundamentally, a piece of declarative intent: I want N replicas of this Pod template. It is not a Pod. The translation from Deployment to Pod is performed by two controllers in series, neither of which talks to the other directly; both talk to the api-server, and the watch is the connecting glue.

The Deployment controller, running inside the kube-controller-manager, sees the ADDED watch event for your new Deployment. Its reconcile loop wakes, fetches the Deployment from its lister, and asks: does a ReplicaSet exist for this Deployment, with the right pod-template hash? If not (and on first apply, no), it creates one. The ReplicaSet name is deterministic — <deployment-name>-<hash-of-template> — so the controller can find the existing one on subsequent reconciles. The Deployment controller updates the Deployment's .status with rollout progress, but does not touch Pods directly.

The ReplicaSet controller, also in the controller-manager, also sees the watch event — for the new ReplicaSet, this time. Its reconcile loop wakes, fetches the ReplicaSet, and asks: how many Pods exist for this ReplicaSet (matched by .spec.selector), and how many should there be? If the actual count is below the desired, it creates Pods one by one with .metadata.ownerReferences pointing at the ReplicaSet. If the actual count is above (a scale-down), it deletes Pods, preferring those with the lowest priority and most recent creation timestamp.

Each created Pod is itself an ADDED event going back through the watch fan-out. The Pod has no spec.nodeName set yet, which is the signal that wakes the scheduler. We will get to the scheduler in a moment. From the ReplicaSet controller's perspective, its job is done as soon as the Pod object exists; what happens to the Pod afterwards is somebody else's reconcile loop.

The cascade pattern repeats up and down the resource hierarchy. A StatefulSet creates Pods and PVCs; a CronJob creates Jobs which create Pods; a Service does not create Pods but the EndpointSlice controller watches Pods and Services and creates EndpointSlices that kube-proxy and CoreDNS consume. Every parent resource has a controller; every controller watches both its own resource and the children it owns; reconciliation is local — each controller does exactly one transformation. Composition emerges from the watch.

The pattern is so disciplined that you can read every built-in controller in pkg/controller/ as variations on the same template:

// pkg/controller/replicaset/replica_set.go — abridged shape, every controller in K8s
// follows this skeleton.

func (rsc *ReplicaSetController) syncReplicaSet(ctx context.Context, key string) error {
    namespace, name, _ := cache.SplitMetaNamespaceKey(key)
    rs, err := rsc.rsLister.ReplicaSets(namespace).Get(name)         // from local cache
    if apierrors.IsNotFound(err) { return nil }                            // already deleted

    pods, err := rsc.podLister.Pods(namespace).List(selector)        // also from cache
    diff := *(rs.Spec.Replicas) - int32(len(filterActivePods(pods)))

    switch {
    case diff < 0:                                                  // scale up
        rsc.podControl.CreatePods(ctx, namespace, &rs.Spec.Template, rs)
    case diff > 0:                                                  // scale down
        rsc.podControl.DeletePods(ctx, namespace, victims, rs)
    }

    return rsc.updateStatus(ctx, rs, observed)                   // PATCH .status
}

Worth highlighting: read from cache, write to api-server. The reconcile loop never calls apiserver.Get directly; it always reads through a lister backed by an informer cache. This single discipline is what lets the api-server scale to clusters with thousands of controllers — they all read locally, and only writes go through the api-server's request handler.

Architectural rule — if you write a controller and find yourself wanting to call another controller's HTTP endpoint, you have made a mistake. The right move is always: post a resource to the api-server, let the other controller's watch wake it. Direct RPC between controllers is forbidden by convention. State is the API.

Hop 10 — the scheduler binds the Pod to a Node.

The Pod object now exists in etcd, but it has no spec.nodeName. Until something writes that field, no kubelet will touch it — kubelets only act on Pods whose nodeName equals their own. The scheduler is the only component (by convention) that writes nodeName. It does so by calling the Bind subresource: POST /pods/<name>/binding.

Internally, the scheduler runs the scheduling framework, which is a chain of plugins with extension points. For each unscheduled Pod the framework runs, in order: PreFilter (compute Pod-wide info), Filter (exclude unsuitable Nodes), PostFilter (preemption if no Node passed), PreScore (compute scoring info), Score (rank survivors 0-100), NormalizeScore, Reserve (provisionally bind resources), Permit (delay the bind for gang scheduling), PreBind, Bind, and PostBind. Built-in plugins implement the obvious things: NodeAffinity, NodePorts, Resources, VolumeBinding, Taints, PodTopologySpread, InterPodAffinity, ImageLocality, and so on.

The Filter stage walks every Node in the cluster and asks each plugin: can this Pod run here? On large clusters this is the expensive step, so the scheduler implements an optimisation: it stops once it has found --percentageOfNodesToScore nodes that pass (default 50% on clusters under 100 nodes, capped lower as the cluster grows). The trade-off is that the picked Node may not be the optimal one — but on a 5,000-node cluster, scoring all 5,000 every time would dominate scheduler latency, and the marginal quality gain is small.

The Score stage ranks the survivors. Each Score plugin returns 0-100 per Node; the framework sums weighted scores; the highest-scoring Node wins. Ties are broken randomly. The winning Node's name is written to the Pod via the Bind subresource:

# What the scheduler sends to bind a Pod to node-3
POST /api/v1/namespaces/prod/pods/web-7d8/binding HTTP/2

{
  "apiVersion": "v1",
  "kind": "Binding",
  "metadata": { "name": "web-7d8" },
  "target": {
    "apiVersion": "v1",
    "kind": "Node",
    "name": "node-3"
  }
}

# 201 Created. The Pod's spec.nodeName is now "node-3".
# A watch MODIFIED event fires. The kubelet on node-3 sees it.

Subresources like Bind are a useful pattern. They expose a narrow operation — “set nodeName” — without giving the caller permission to update the entire Pod. The scheduler's RBAC grants create on pods/binding, which would otherwise allow nothing else; it cannot mutate spec.containers or any other field. Status updates use a similar pattern via the status subresource — kubelet has update on pods/status and that is how it reports Pod readiness without being able to modify spec.

If no Node passes the Filter stage, the scheduler runs PostFilter, which may attempt preemption: identify lower-priority Pods on a candidate Node, mark them for eviction, and retry. If preemption also fails, the Pod stays Pending; a watcher fires when any Node's capacity changes (a Pod is deleted, a Node is added) and the scheduler retries automatically. This is why a cluster can absorb a surge of Pending Pods and drain them quickly when capacity arrives — the queue is event-driven, not polled.

Performance pointer — scheduler throughput on a healthy cluster is ~200 binds/second per replica. If you are scheduling a flood (a CronJob that fires 10,000 Pods, an HPA scale-up during an incident), watch the scheduler's scheduling_attempt_duration_seconds histogram. The dominant tail latency is almost always VolumeBinding for PVCs that need dynamic provisioning.

Hops 11 + 12 — kubelet sync loop, CRI calls, Ready.

The Pod is bound to node-3. The kubelet on node-3 has been holding open a long-poll watch on /api/v1/pods?fieldSelector=spec.nodeName=node-3 since boot, and within milliseconds it sees the MODIFIED event with the new nodeName. The Pod is now in the kubelet's working set. What happens next is the kubelet's SyncLoop — the heartbeat that owns the data plane.

The SyncLoop runs roughly ten times per second. Each iteration consumes events from four sources — the api-server watch, a file-watch on /etc/kubernetes/manifests/ for static pods, an HTTP-pull source for legacy installations, and a periodic re-sync timer — and computes the desired set of Pods the node should be running. It then walks the actual set of Pods (queried from the container runtime via CRI's ListPodSandbox), diffs the two, and emits SyncPod operations for each Pod that differs. SyncPod is itself a small state machine.

For a brand-new Pod, SyncPod runs roughly this sequence: ensure the Pod's network namespace exists by calling CRI RunPodSandbox; this triggers the CRI runtime (containerd) to create a network namespace, exec the CNI plugin (Cilium / Calico / Flannel) to attach a veth and assign an IP, and start the pause container that holds the namespace open. Then for each init container in order: pull the image (CRI PullImage) if it is not in the local cache, create the container (CRI CreateContainer), start it (CRI StartContainer), wait for exit. Then for each regular container in parallel: pull, create, start, configure probes. Then update the Pod's status via PATCH /pods/.../status.

CRI is a gRPC service over a Unix domain socket — typically /run/containerd/containerd.sock. The full RPC surface is about thirty methods spanning sandbox lifecycle (RunPodSandbox, StopPodSandbox, RemovePodSandbox, ListPodSandbox), container lifecycle (CreateContainer, StartContainer, StopContainer, RemoveContainer, ListContainers), image management (PullImage, RemoveImage, ListImages, ImageStatus), exec/attach/port-forward, and stats. The protocol is defined in cri-api/pkg/apis/runtime/v1/api.proto and is the only thing the kubelet calls when it wants a container to exist.

CNI is even simpler — a binary in /opt/cni/bin/ that the runtime exec's with a JSON config on stdin and which prints the assigned IP on stdout. The kubelet itself does not call CNI; the runtime does, on kubelet's behalf, during RunPodSandbox. The plug-in design is why you can swap your CNI in production by changing the CNI config file and rolling the nodes; nothing in Kubernetes itself has to change.

Once the container is running and the readiness probe passes, the kubelet patches the Pod's status with conditions: [{ type: Ready, status: "True" }]. That patch fans out via the watch. The EndpointSlice controller — yet another reconciler in the controller-manager — sees the new Ready Pod, looks up which Service selectors match its labels, and adds the Pod's IP to the appropriate EndpointSlice. kube-proxy on every node sees the EndpointSlice update via its own watch and reprograms its iptables / nftables / eBPF rules to include the new backend. Within another few hundred milliseconds, traffic destined for the Service ClusterIP starts being load-balanced to the new Pod. That is hop 12. The chain is complete.

Operational note — image pull is by far the most variable hop. A pre-pulled image starts in under a second; a cold pull of a multi-gigabyte image from a slow registry can take minutes. The remediations are container image streaming (Stargz, eStargz), node-local image caches (Spegel, kraken), and cluster-local registry mirrors. Watch the kubelet's kubelet_image_pull_duration_seconds histogram; it is the single best signal for “why are my Pods slow to start”.

Keep going.

Pod scheduling, end to end

From pending Pod to running container, through the scheduler framework and the kubelet SyncLoop.

The controller pattern

Informers, listers, work queues, reconciliation. Pseudocode you can ship.

Architecture

The static map: control plane and data plane, ports, TLS, the watch model, leader election.

Read ↑

Back to the internals index

All twelve sub-pages — four live, eight planned — and the system on one canvas.

Index

Found this useful?

Twelve hops from the keystroketo the running pod.

The twelve hops, on one canvas.

Hop 01 — kubectl resolves config and builds the REST request.

Hops 02 + 03 — TLS, authentication, RBAC.

Hops 04, 05, 06 — admission and schema validation.

Hop 07 — etcd: how an apply becomes a Txn.

Hop 08 — watch fan-out: from one write to many readers.

Hop 09 — the reconciliation cascade.

Hop 10 — the scheduler binds the Pod to a Node.

Hops 11 + 12 — kubelet sync loop, CRI calls, Ready.

Further reading — kubernetes.io, source pointers, KEPs.

Keep going.

Twelve hops from the keystroke
to the running pod.