28 · 5 steps
Visualize / 28

Kubernetes scheduler.

A new pod arrives. The scheduler has to pick which of N nodes will run it. It filters out the ones that can\'t (not enough RAM, wrong taints, anti-affinity violated), scores the survivors against several criteria, picks the winner, and writes the binding. Watch the whole dance with five candidate nodes.


step 1 / 5 · phase queue
INCOMING PODapi-server-x9zrequests: 4 CPU · 16 GB RAMpriorityClass: default · no taints/tolerationsnode-1CPU 6/8RAM 28/32 GBzone=us-east-1atier=compute CANDIDATEnode-2CPU 4/16RAM 12/64 GBzone=us-east-1btier=compute CANDIDATEnode-3CPU 2/8RAM 8/32 GBzone=us-east-1atier=gpu CANDIDATEnode-4CPU 1/16RAM 6/64 GBzone=us-east-1ctier=compute CANDIDATEnode-5CPU 8/32RAM 32/128 GBzone=us-east-1btier=memory CANDIDATE QUEUE FILTER SCORE BIND DONE
New pod enters the scheduler queue

A user runs `kubectl apply -f deployment.yaml`. The API server stores the new pod with no node assignment. The scheduler watches for unassigned pods, picks this one up. Five nodes are candidates. The scheduler will now filter and score them.

kube-scheduler
A control-plane component. Watches the API server for unscheduled pods, picks a node, writes the binding back. Doesn't actually run anything — kubelets on each node do that after seeing the binding.
Filter then Score
Two-phase decision. Filter: which nodes COULD run this pod? Score: of those, which is BEST? Both phases are extensible via "scheduler plugins."

Why two phases and not just one big score

Filter is cheap and binary: this node literally cannot run this pod, eliminate. Score is more expensive (call all plugins, compute weighted sum). Doing the cheap pass first prunes the candidate set so the expensive pass runs on fewer nodes. On a 5000-node cluster with a pod requesting GPU + specific zone, filter might cut down to 30 candidates; score then runs on those 30 in ~10 ms.

Constraints you can actually use

nodeSelector: simplest — pin to nodes with matching labels. nodeAffinity: same but with required/preferred and richer expressions. podAffinity / podAntiAffinity: "schedule me near (or far from) pods matching this label selector." Taints/Tolerations: nodes can repel pods unless they tolerate the taint. topologySpreadConstraints: keep replicas spread evenly across zones. resources.requests: minimum guaranteed CPU/RAM. resources.limits: maximum allowed. Together, you can express almost any placement policy without writing a custom scheduler.

When the scheduler can\'t place a pod

Pod stays in Pending. The scheduler logs why (visible via kubectl describe pod: "0/5 nodes available, 3 had taint, 2 didn\'t match nodeSelector"). The cluster autoscaler can react by spinning up new nodes. Preemption — kicking lower-priority pods off to make room — is also available via PriorityClass. Critical workloads stay running; batch jobs get evicted to make room.

Go deeper

Scheduling internals →

The scheduling framework, custom schedulers, multi-scheduler setups, descheduler patterns, why scheduling is harder than it looks at 5000+ nodes.

Open the Codex →
Found this useful?