Eleven extension points,
one bind subresource.
The Kubernetes scheduler is not a single algorithm; it is a generic framework that runs eleven ordered hooks against every pending Pod, with the default scheduler being one particular set of plugins wired into those hooks. Understanding the framework — what each extension point may do, what it costs, and where preemption fits — is the difference between treating the scheduler as a black box and being able to bend it to a new constraint.
Roughly 4,400 words. Pair it with the pod lifecycle sub-page for what happens after Bind, and the rollout simulator for the visceral version of the placement decisions described here.
Why scheduling is a separate binary at all.
The kube-scheduler is its own process — not a goroutine inside the api-server, not a library inside the controller-manager, but an independent binary running under its own leader-elected lease at kube-system/kube-scheduler. That choice looks frivolous on day one (one more thing to deploy, one more thing to upgrade, one more lease to lose) and inevitable on day five hundred (the only realistic way to swap, replace, or augment the placement algorithm without forking the api-server). Every architectural property of the scheduler we care about flows from this decoupling, so it is worth being explicit about why the boundary exists.
The first reason is correctness. Scheduling decisions are derived from a snapshot of cluster state — every Node's allocatable capacity, every Pod's resource request, every taint and topology label — and producing a coherent decision requires that snapshot to stay coherent for the duration of the decision. If scheduling lived inside the api-server, every other request handler (admission, conversion, watch fan-out) would compete with it for memory, CPU, and lock contention on shared caches. By splitting it out, the scheduler maintains its own snapshot, refreshed from a watch, and runs in a single goroutine that is not perturbed by the api-server's request volume.
The second reason is replaceability. The api-server is non-negotiable: every component in the cluster talks to it, and you cannot run two of them at once with different schemas without breaking everything. The scheduler, in contrast, is reached only by setting one field on a Pod — spec.schedulerName — which means you can run the default scheduler alongside Volcano (gang scheduling for ML training), alongside Yunikorn (capacity-aware fair scheduling), alongside a tiny bespoke scheduler your team wrote for one workload class. Each binary owns the Pods that name it. They do not need to agree, communicate, or know about each other.
The third reason is observability. Because the scheduler's only outputs are Bind requests and Pod-status updates, every decision it makes is recorded in the api-server's audit log and as a Kubernetes Event on the Pod. You can grep for FailedScheduling across a cluster and reconstruct, with no access to the scheduler process at all, why every unscheduled Pod is unscheduled. If the scheduler had been a library inside the api-server, you would be looking at api-server logs instead, multiplexed with every other request, and a single misbehaving plugin would be invisible against the noise floor.
The fourth reason is upgrade safety. The scheduler binary changes more often than the api-server; new plugins are added, scoring formulas are adjusted, the framework grows new extension points. By decoupling, you can roll the scheduler independently — a canary deploy of a new scheduler version sees only its own Pods, fails alone, and rolls back alone. In a monolithic design every scheduler change would be coupled to an api-server change, and the blast radius of a regression would be the whole control plane.
One last property is non-obvious: the scheduler does not need write permission on most of the cluster's state. Its RBAC is unusually narrow — it watches Pods, Nodes, PVCs, PVs, StorageClasses, CSINodes, and a few others; it writes only the /binding subresource on Pods and the status.nominatedNodeName field for preemption. Compromising the scheduler does not give you cluster-admin. That is a deliberate consequence of pushing it out into its own process and giving it the smallest possible ServiceAccount.
Operational consequence — when you upgrade kube-scheduler from 1.29 to 1.30, only Pods named by the upgraded binary are affected. A regression in NodeResourcesFit's BalancedAllocation scoring shows up as bad placement on those Pods alone, never as failed admission or 5xx on the api-server. That is the value of process boundary.
The framework — eleven hooks in a fixed order.
The scheduling framework, introduced as KEP-624 and fully on by default since 1.19, replaced the older predicates-and-priorities model with a single ordered chain of extension points. Each plugin in the chain implements one or more of those points; the scheduler's main loop is a generic walker over the chain, asking every plugin at every step whether it has work to do. The plugins know nothing about each other; they communicate, when they must, through a per-cycle CycleState bag that survives from PreFilter through PostBind.
The cycle splits naturally into two phases — a scheduling cycle and a binding cycle — because the scheduling cycle must be serialised (it mutates the node-resource accounting cache that subsequent cycles read), while the binding cycle can run concurrently in a separate goroutine (it just talks to the api-server and waits for the bind to commit). Reserve, Permit, PreBind, Bind, and PostBind run in the binding cycle. The rest run in the scheduling cycle. Drawing the boundary in the wrong place is one of the most common mistakes when writing a custom plugin: holding state in PreFilter that you only use in PreBind means you are paying for synchronous bookkeeping during the cycle's bottleneck.
| Extension point | Verdict shape | What it is for |
|---|---|---|
| PreFilter | cycle-fatal | Pre-process the Pod, build per-pod state, short-circuit if hopeless |
| Filter | per-node yes/no | Predicate. Run once per feasible node. Survivors form the candidate set |
| PostFilter | fires only on filter failure | Last-chance. Default plugin here implements preemption |
| PreScore | cycle-shared state | Build state every Score call needs (e.g. topology counts) |
| Score | per-node 0–100 | Rank survivors. Higher is better. Plugins are weight-summed |
| NormalizeScore | rescale | Same plugin can rewrite its own scores into a comparable scale |
| Reserve | tentative claim | Update in-memory cache as if Pod were bound. Reservable for un-reserve on failure |
| Permit | gate | Approve, deny, or wait. Gang scheduling and batch coordination live here |
| PreBind | side-effect | Provision a volume, attach a device, pre-warm a sandbox |
| Bind | commit | POST to /pods/{name}/binding. Default impl is the only one most clusters need |
| PostBind | fire-and-forget | Cleanup, telemetry, async housekeeping. Cannot fail the bind |
The order is fixed and meaningful. PreFilter runs once per cycle and may cancel the cycle entirely if its preflight check fails — for example, the NodeResourcesFit plugin will reject a Pod whose CPU request exceeds the largest single node's capacity at PreFilter, because no Filter result can change that verdict. Filter then runs once per node in the feasibility set, in parallel across worker goroutines. Survivors go to PreScore (one call) and Score (one call per survivor). PostFilter is the only point that runs only when Filter produced an empty set — its job is to do something rather than fail, and "something" is almost always preemption.
Reserve, Permit, PreBind, Bind, and PostBind are gates and side-effects on a single already-chosen node. Reserve mutates the scheduler's in-memory cache to claim the node's resources before the api-server has confirmed the bind, so the next cycle's NodeResourcesFit does not double-spend the same CPU. Permit can wait — gang scheduling uses this hook to delay binding the first Pod of a job until enough siblings have also been chosen. PreBind does work the bind is contingent on (a volume provisioned, a device attached) and may fail, in which case Reserve runs in reverse via Unreserve.
Throughout the cycle, plugins read from and write to a per-Pod CycleState map. NodeAffinity's PreFilter parses the Pod's required and preferred terms once and stashes them as a typed value; Filter and Score then retrieve the parsed terms by key without re-parsing per node. This is the framework's answer to the central performance question of scheduling: never repeat per-node what you can compute per-Pod.
The order of plugins within a single extension point is also configurable, and matters. If your custom Filter is cheap and rejects 90% of nodes, put it first; if it is expensive and rarely rejects, put it last. The KubeSchedulerConfiguration's enabled array preserves declaration order.
PreFilter and Filter — building the feasibility set.
The scheduler's first job for any Pod is to determine which Nodes are even possible targets — the feasibility set. Everything downstream (scoring, preemption, binding) operates on this set, and getting it right is both correctness-critical (a wrong inclusion means the Pod gets bound to a Node it cannot run on, and the kubelet rejects it) and latency-critical (a feasibility check that scales linearly with cluster size becomes the bottleneck on a fleet of five thousand nodes). The framework splits the work into two hooks for that reason: PreFilter, called once, computes anything per-Pod that the Filter calls will need; Filter, called per node, runs the cheap predicate.
The default plugins that contribute Filter checks read like a checklist of physical and logical reasons a Pod cannot land on a Node. NodeResourcesFit refuses any Node whose allocatable CPU, memory, or ephemeral-storage minus its already-reserved sum is below the Pod's requests. NodeAffinity refuses any Node that fails to match a required affinity term (preferred terms are scoring-only). NodePorts refuses any Node whose hostPort range already contains a Pod requesting the same port. NodeUnschedulable refuses any Node with spec.unschedulable=true set, which is what kubectl cordon does. TaintToleration refuses any Node whose NoSchedule or NoExecute taints are not tolerated by the Pod. PodTopologySpread refuses any Node where placing the Pod would violate a hard whenUnsatisfiable: DoNotSchedule constraint.
These checks are independent — the framework is not a chain in the sense of "plugin A's output feeds plugin B"; every plugin gets the same Pod and the same Node and the same CycleState, and every plugin must agree before a Node is feasible. The framework runs them in parallel across worker goroutines (one goroutine per CPU by default), and aggregates the verdicts. A Node fails feasibility on the first NO from any plugin, and the framework short-circuits the rest of that node's evaluation. Failure reasons are collected as per-plugin diagnostics, surfaced on the Pod's FailedScheduling Event in the form "0/5 nodes available: 3 Insufficient cpu, 2 node(s) didn't match Pod's node affinity". The aggregation by reason is what makes that message useful for triage.
Now consider scale. On a five-thousand-node cluster, running every Filter plugin against every Node is real work — NodeResourcesFit's PreFilter alone has to walk the Pod's containers and aggregate requests, and Filter then does an arithmetic compare against the Node's snapshot. Empirically, evaluating five thousand nodes through the default chain is on the order of fifty to one hundred milliseconds, which dominates per-cycle latency and caps cluster-wide scheduling throughput at maybe ten to twenty Pods per second. The framework's answer is percentageOfNodesToScore, covered in the next section, which short-circuits the feasibility loop once enough feasible Nodes have been found.
// pkg/scheduler/framework/plugins/noderesources/fit.go — abridged // PreFilter runs once. It pre-computes the Pod's resource requirements. func (f *Fit) PreFilter(ctx context.Context, state *CycleState, pod *v1.Pod) (*PreFilterResult, *Status) { cycleState.Write(preFilterStateKey, &preFilterState{ skip: podHasNoRequests(pod), framework.Resource: computePodRequests(pod), ScalarResources: computeScalars(pod), }) return nil, framework.NewStatus(framework.Success) } // Filter runs per node. It does the cheap arithmetic compare. func (f *Fit) Filter(ctx context.Context, state *CycleState, pod *v1.Pod, n *NodeInfo) *Status { s, _ := getPreFilterState(state) if s.skip { return nil } insufficient := fitsRequest(s, n, f.ignoredResources) if len(insufficient) > 0 { return framework.NewStatus(framework.Unschedulable, insufficient...) } return nil }
The split is the optimisation. Without PreFilter, every Filter call would need to walk the Pod's containers to aggregate requests; with it, that work is done once and the per-Node path is a few arithmetic compares. Multiply by five thousand Nodes and the saving is measurable. Every well-written plugin follows the same pattern: anything Pod-shaped goes in PreFilter, anything Node-shaped goes in Filter, the bridge is a typed value in CycleState.
PodTopologySpread is the most architecturally interesting of the default Filters because its predicate cannot be evaluated by looking at the candidate Node alone — it depends on how many sibling Pods are already on every Node sharing a topology key (zone, hostname, rack). Its PreFilter walks the entire current placement to count siblings per topology domain; its Filter then asks, for the candidate, "would this placement push the Pod count on this domain past the skew limit?" This is why PodTopologySpread is the default plugin most likely to get expensive on large clusters, and why operators sometimes pin its constraints to a coarser key (zone instead of hostname) to keep the count cheap.
A common gotcha — Filter plugins must be stateless across Pods. The framework reuses the same plugin instance for every Pod, in parallel cycles, and storing per-Pod data on the plugin struct rather than in CycleState is a race waiting to happen. The compiler will not catch this; integration tests and chaos under load will.
Score, NormalizeScore, and the percentageOfNodesToScore knob.
Once Filter has produced a feasibility set, the scheduler ranks the survivors. Scoring is where placement quality lives — the difference between binding to the most-loaded acceptable Node (a bin-packing scheduler) and binding to the least-loaded acceptable Node (a balanced scheduler) is one constant in one Score plugin, and that constant is the difference between an autoscaler that scales down predictably and one that does not. Scoring is purely advisory at the framework level: every plugin returns an integer 0–100 per Node, the framework multiplies by a per-plugin weight, sums across plugins, and binds to the Node with the highest total. Ties are broken by the plugin order and finally by Node name.
The default Score plugins each express one preference. NodeResourcesFit, in its default LeastAllocated mode, prefers Nodes with more headroom — its score is 100 minus the average of CPU and memory utilisation. (It also supports MostAllocated for bin-packing and RequestedToCapacityRatio for arbitrary curves.) ImageLocality prefers Nodes that already have the Pod's container images on disk — non-trivial because pulling a 5GB container image takes tens of seconds and dominates start-up latency for big workloads. PodTopologySpread, in its scoring mode, prefers Nodes whose topology domain has fewer existing siblings, which evens out distribution within hard constraints. NodeAffinity's preferred terms add weighted bonuses. TaintToleration's PreferNoSchedule taint subtracts weighted penalties.
NormalizeScore exists because the natural output of a plugin is rarely on the same scale as another plugin's. PodTopologySpread, for example, computes a raw skew that can be 0 to thousands; its NormalizeScore rewrites the per-Node values so the worst Node is 0 and the best is 100. Without that step, a Score plugin with a wider natural range would dominate the sum, and tuning weights would be impossible. The framework calls NormalizeScore once per plugin, after all per-Node Scores are in, and the plugin can rewrite the []NodeScore slice in place.
// kube-scheduler -v=4 --- one cycle, default plugins, 18 candidates
I0503 12:14:08.421 scheduling Pod default/web-7d8 schedulerName=default-scheduler
I0503 12:14:08.422 nodes feasible: 18 / 240 (NodeResourcesFit kept 200, NodeAffinity kept 60, PodTopologySpread kept 18)
I0503 12:14:08.422 scoring 18 nodes
I0503 12:14:08.423 scoring node-019 NodeResourcesFit=88 ImageLocality=100 PodTopologySpread=72 TaintToleration=100 :: total 92 (winner)
I0503 12:14:08.423 scoring node-217 NodeResourcesFit=82 ImageLocality= 0 PodTopologySpread=78 TaintToleration=100 :: total 65
I0503 12:14:08.423 reserve node-019
I0503 12:14:08.424 permit node-019 approved by all 6 plugins
I0503 12:14:08.424 prebind node-019 VolumeBinding ok
I0503 12:14:08.425 bind POST /api/v1/namespaces/default/pods/web-7d8/binding nodeName=node-019 201
I0503 12:14:08.426 postbind metrics emitted That brings us to percentageOfNodesToScore, the most consequential single knob in scheduler configuration. By default it is set adaptively (50% on small clusters, dropping to 5% on a five-thousand-node cluster), and what it controls is how many feasible Nodes the scheduler stops at before it scores. If the value is 5, the scheduler walks Nodes in a deterministic-but-shifted order, runs Filter against each, and the moment it has accumulated 5% of cluster-size feasible Nodes (or one hundred, whichever is larger), it stops filtering and proceeds to scoring on just those. Score does not need to see every Node; it needs to see enough Nodes to find a good one.
The trade-off is sharper than it looks. Set it too high (or 100), and on a large cluster the scheduler spends 80% of its cycle in Filter and binds Pods at single-digit per-second rates. Set it too low (1), and on a heterogeneous cluster the scheduler may consistently miss the best Node — for example, the one Node that already has the image cached. The scheduler also rotates the start-of-walk index across cycles so that, over time, every Node gets considered, mitigating the worst-case unfairness. For typical workloads, the default (adaptive) is right; the time you change it is when you have measured scheduler latency dominating Pod start-up time.
Diagnostic — scheduler_pending_pods, scheduler_scheduling_attempt_duration_seconds, and scheduler_framework_extension_point_duration_seconds{ extension_point="Filter" } in the scheduler's Prometheus output are the three series that tell you whether you have a percentageOfNodesToScore problem. If Filter dominates the latency histogram, lower the percentage.
Reserve, Permit, PreBind, Bind, PostBind — the binding cycle.
By the time scoring is done, the scheduler knows exactly which Node a Pod will land on. The remaining hooks are concerned with making that landing actually happen — synchronising the in-memory cache, gating on external conditions, performing pre-bind side-effects, calling the api-server's bind subresource, and emitting telemetry. Most clusters never need to customise any of these. They are the framework's quietest extension points precisely because the defaults are correct and the costs of getting them wrong are high.
Reserve runs first. Its job is to mutate the scheduler's in-memory snapshot as if the Pod were already bound — decrementing the chosen Node's allocatable CPU and memory, incrementing the topology domain's sibling count, marking any reserved volumes — so that the next scheduling cycle sees the correct accounting. This is necessary because the api-server's confirmation of the bind is asynchronous: the scheduler issues the Bind, and if it waited for the watch event to come back round before scheduling the next Pod, it would serialize all decisions on a network round trip. Reserve gives the cache the synchronous update; the watch event later just confirms it.
Reserve is paired with Unreserve, called only on failure. If PreBind or Bind fails, the framework walks the chain in reverse calling Unreserve on every plugin that successfully Reserved, undoing the cache mutation. This is the only place in the framework that has explicit transaction semantics; the rest is best-effort and idempotent.
Permit is where gang scheduling lives. A Permit plugin can return one of three verdicts: Success (proceed to PreBind), Wait (park this Pod for up to a configurable timeout, releasing the cycle goroutine to schedule other Pods), or Reject (abort, run Unreserve, requeue the Pod). The Wait verdict is the interesting one. Volcano's Permit plugin uses it to delay binding the first Pod of a job until enough siblings have also been chosen — only then does it call Allow on every parked Pod, and they all bind in a coordinated burst. This is the framework's solution to coordinated batch placement without having to invent a separate two-phase commit.
PreBind is the side-effect hook. Its canonical user is VolumeBinding: when a Pod has a PVC with volumeBindingMode: WaitForFirstConsumer, the scheduler delays the actual PV binding until it knows which Node it has chosen, because the PV must satisfy that Node's topology constraints. PreBind issues the API call to bind the PVC to a PV, waits for the binding to be acknowledged, and only then returns Success. If it fails, the binding cycle aborts; the Pod returns to the queue.
Bind itself is the simplest hook in the entire chain. The default binder, in 1.30, literally does this: apiClient.CoreV1().Pods(ns).Bind(ctx, &v1.Binding{…}). It POSTs to the /api/v1/namespaces/{ns}/pods/{name}/binding subresource, which is the only api-server endpoint allowed to set spec.nodeName on an already-existing Pod, and waits for the 201. If the bind succeeds, the framework dispatches PostBind for telemetry; if it fails with a 409 (conflict — someone else bound the Pod meanwhile), the framework runs Unreserve and requeues. You can implement a custom Bind plugin, but you almost certainly should not — the only realistic reason is exotic placement (an external scheduler that wants to write the result via a CRD instead of /binding).
PostBind is fire-and-forget. By the time it runs, the Pod is officially bound; the kubelet on the chosen Node will see the watch event and start work imminently. PostBind exists for plugins that want to emit metrics, write a custom event, or kick an external system. It cannot fail the bind. If it errors, the framework logs and moves on.
Operational note — the binding cycle is the place where a slow external dependency (volume provisioning, GPU device attachment, custom webhook) silently caps cluster scheduling throughput. Even though it runs in parallel with the next pod's scheduling cycle, the scheduler holds an in-flight-binds limiter (default 100), and if PreBind takes ten seconds, you bottleneck at ten binds per second. Watch scheduler_pending_pods{ queue="binding" }.
Preemption — the PostFilter hook and the nominator pattern.
Preemption is the cluster's pressure valve. When a high-priority Pod cannot be scheduled because no Node has capacity, the scheduler may decide to evict lower-priority Pods to make room. The mechanism for this in the framework is PostFilter — a hook that runs only when Filter has produced an empty feasibility set, i.e. the Pod would otherwise be marked Unschedulable. The default plugin at PostFilter is DefaultPreemption; it implements the algorithm described in KEP-270. That plugin is opt-out: a scheduling profile can disable it, and some clusters do, in favour of stricter capacity guarantees.
The Pod must be eligible. Preemption requires the Pod to have a priorityClassName set, which translates to a numeric spec.priority via a cluster-scoped PriorityClass object. PriorityClasses ship two well-known: system-cluster-critical (2,000,000,000) and system-node-critical (2,000,001,000). Most workload Pods sit in the user-defined range, often below the globalDefault set by the cluster admin. The Pod can also opt out via spec.preemptionPolicy: Never, which means "I want priority for queue ordering but I will not evict anyone to schedule".
The algorithm is a search. For each Node in the feasibility-rejected set, the preemption plugin asks: which Pods on this Node could I evict to make my Pod fit, given that I can only evict Pods with priority strictly lower than mine? It computes a candidate victim set per Node. From the per-Node victim sets, it picks the Node whose victim set is cheapest by a tiebreaker chain: fewest violated PodDisruptionBudgets, fewest victims, highest minimum-priority among victims, longest-running victims preferred to short-lived ones, then Node name as a final deterministic break.
When the search succeeds, the plugin does not actually delete the victims. That is the design choice that matters. The plugin sets status.nominatedNodeName on the Pod to the chosen Node and issues delete requests for each victim with the standard grace period. The high-priority Pod is then returned to the scheduler queue. On the next cycle, the plugin sees a Pod with a nominatedNodeName, treats that Node as a strong hint, and waits for the victims to drain. Once they do, the next Filter pass finds the Node feasible, and the Pod binds normally.
This nominator pattern is subtle. It means preemption is asynchronous: the high-priority Pod does not jump the queue, and the cluster does not stall waiting for victims to die. A well-behaved victim respects its terminationGracePeriodSeconds and shuts down cleanly; a badly-behaved one ignores SIGTERM and blocks the cycle. The nominatedNodeName field is also visible to controllers, which means tools watching for "this Pod is about to land here" can prepare. The eviction simulator visualises the cascade.
# A high-priority Pod with explicit preemption policy apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: latency-critical value: 1000000 globalDefault: false description: "Customer-facing services. Will preempt lower-priority Pods." --- apiVersion: v1 kind: Pod metadata: name: payments-api spec: priorityClassName: latency-critical preemptionPolicy: PreemptLowerPriority # default; the alternative is Never containers: - name: api image: payments-api:1.4.0 resources: requests: { cpu: "2", memory: "4Gi" }
Preemption interacts in subtle ways with PodDisruptionBudgets. A PDB declares a minimum number of replicas that must remain available; preemption tries hard not to violate it, but the algorithm treats PDB violations as a soft preference (lower-cost candidate sets are preferred), not a hard constraint. If every candidate violates a PDB, preemption will still fire — the high-priority Pod's claim wins. Operators who want hard PDB protection use scheduling profiles to disable DefaultPreemption for low-priority workloads, or set preemptionPolicy: Never on the high-priority side.
A subtle interaction — preemption searches the feasibility-rejected set, but PodTopologySpread with hard constraints can keep that set empty in pathological topologies (one zone is full, one zone is empty). The plugin will not preempt across topology constraints; it has no way to know that placing the Pod elsewhere would be acceptable. The fix is to soften the constraint to whenUnsatisfiable: ScheduleAnyway, or accept that the Pod stays Pending.
Scheduling profiles and multi-scheduler — picking the right shape.
A single kube-scheduler binary can host multiple scheduling profiles, each with its own plugin set and weights, addressed by the Pod's spec.schedulerName. This is the framework's answer to "I have two workload classes and I want different placement for each", without forcing the operator to deploy a second scheduler binary. Profiles share the same in-memory snapshot, the same Node informers, the same leader-election lease — so they are cheap — but they can differ in which plugins they enable, in what order, and with what weights. KEP-1451 introduced them; every modern cluster has at least one.
# /etc/kubernetes/kube-scheduler-config.yaml — KubeSchedulerConfiguration v1 apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration percentageOfNodesToScore: 30 profiles: - schedulerName: default-scheduler plugins: score: enabled: - { name: NodeResourcesFit, weight: 1 } - { name: ImageLocality, weight: 2 } # bias toward image-cached nodes - { name: PodTopologySpread, weight: 4 } - { name: TaintToleration, weight: 3 } pluginConfig: - name: NodeResourcesFit args: scoringStrategy: type: LeastAllocated resources: - { name: cpu, weight: 1 } - { name: memory, weight: 1 } - schedulerName: bin-packing-scheduler plugins: postFilter: disabled: - { name: DefaultPreemption } # no preemption for batch pluginConfig: - name: NodeResourcesFit args: scoringStrategy: type: MostAllocated # pack tightly - schedulerName: latency-spread-scheduler plugins: score: enabled: - { name: PodTopologySpread, weight: 10 } # dominate the score - { name: NodeResourcesFit, weight: 1 }
That single binary now serves three workload classes. A Pod with no schedulerName uses the default profile. A batch Job that sets spec.schedulerName: bin-packing-scheduler gets MostAllocated scoring without preemption. A latency-sensitive Service Pod with spec.schedulerName: latency-spread-scheduler gets aggressive topology spread. All three share the same informer caches and run inside the same goroutine pool. Costs are minimal; expressiveness is high.
When profiles are not enough, you run a separate scheduler binary. The pattern is the same either way: deploy the binary, give it a service account with the standard scheduler RBAC (read Pods/Nodes/PVCs/PVs/etc, write /binding and the nominatedNodeName status field), give it its own leader-election lease in a different name, and have it filter on its own spec.schedulerName. The default scheduler ignores Pods named for someone else; your scheduler ignores Pods not named for it. They never see each other's work.
Several real schedulers ship this way. Volcano is a gang-aware scheduler for ML and batch workloads — it leans heavily on the Permit extension point to coordinate placement of N-Pod jobs, refusing to bind any Pod of a job until enough siblings have been placed. Apache Yunikorn adds capacity-aware fair scheduling with hierarchical queues, useful when ten teams share one cluster. descheduler is the inverse — not a scheduler, but a controller that periodically scans for sub-optimally placed Pods (high node imbalance, violated affinity rules after a Node cordon) and evicts them, letting the regular scheduler pick a better home. Karmada goes one step further — a federation layer that schedules at the cluster level, picking which member cluster a workload should land in before the per-cluster scheduler picks the Node within it.
Choosing between a profile and a separate binary is a question of how much you need to diverge. If your tweak is "different weights, plus or minus one or two plugins", a profile is right; you keep the framework's correctness guarantees and pay no operational cost. If your tweak is "I need a totally different placement algorithm, talking to an external solver, with a multi-second latency budget", a separate binary is right; profile-mode would block other Pods on your slow cycle. The midpoint, a custom plugin dropped into the default scheduler binary via the framework's plugin SDK, is suited to the most common case: you have one new constraint to express and the rest of the default behaviour is fine.
A custom plugin is, in implementation, a Go struct that satisfies one or more of the framework's per-extension-point interfaces and is registered with app.NewSchedulerCommand at build time. A minimal Filter and Score plugin looks like this:
// my-plugin/plugin.go — registers as "MyAffinity" package myaffinity import ( "context" v1 "k8s.io/api/core/v1" framework "k8s.io/kubernetes/pkg/scheduler/framework" ) type Plugin struct { handle framework.Handle } func (p *Plugin) Name() string { return "MyAffinity" } func (p *Plugin) Filter(ctx context.Context, s *framework.CycleState, pod *v1.Pod, n *framework.NodeInfo) *framework.Status { if pod.Labels["workload"] == "gpu" && n.Node().Labels["hardware"] != "gpu" { return framework.NewStatus(framework.UnschedulableAndUnresolvable, "node has no GPU") } return nil } func (p *Plugin) Score(ctx context.Context, s *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) { n, _ := p.handle.SnapshotSharedLister().NodeInfos().Get(nodeName) if n.Node().Labels["zone"] == pod.Annotations["preferred-zone"] { return 100, nil } return 0, nil } func (p *Plugin) ScoreExtensions() framework.ScoreExtensions { return nil } func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) { return &Plugin{ handle: h }, nil }
The UnschedulableAndUnresolvable status is important — it tells the framework that no preemption can fix this. Returning plain Unschedulable instead would make the preemption plugin search for victims to evict, which is wrong if your constraint is fundamental (the Node lacks hardware).
Further reading — KEPs, source pointers, the perf use.
The scheduler is one of the better-documented subsystems in Kubernetes, partly because its API surface is small and partly because the framework's introduction in 1.19 forced a clean re-write of the docs. The list below is the canonical set, ordered by how often you will return to it. The default plugins table at the top of this page maps directly to the files in pkg/scheduler/framework/plugins/; reading those is the fastest way to understand what each plugin does, and the plugin tests are some of the clearest test code in the kubernetes/kubernetes tree.
If you are tuning a cluster, the resource you will reach for most is scheduler-perf, the in-tree benchmark use at test/integration/scheduler_perf/. It synthesises clusters of up to five thousand Nodes and ten thousand Pods, runs the scheduler against them, and produces latency histograms per extension point. It is what the upstream contributors use to measure the impact of every framework change; it is what you should use before you ship a custom plugin to a fleet larger than a couple of hundred Nodes.
And the rest of the Semicolony ladder: the architecture sub-page is the eight-process map this scheduler lives inside; the pod lifecycle sub-page picks up at the moment Bind commits and follows the Pod through the kubelet's SyncLoop; the controller pattern sub-page is the deeper version of "scheduler is just an informer-driven controller with one specific reconciliation goal". For the moving picture, the rollout simulator and the eviction simulator show what scoring choices and preemption look like across a sustained workload.
One last observation. The scheduling framework is, on examination, an unusually clean instance of the strategy pattern at scale — eleven hooks, a fixed order, a generic walker, and a registry of plugins that each implement only the hooks they care about. The default scheduler is one configuration of the framework; Volcano is another; your custom plugin dropped into the default binary is a third. The shape is reusable because it draws the right boundaries: per-Pod state in CycleState, per-cycle accounting in the snapshot, and every irreversible action gated behind Reserve / Unreserve. When you build a similar system — a custom controller with multiple decision phases, an admission policy with ordered checks — copy that shape. Eleven hooks is roughly the right number, and short-circuiting on the first NO is roughly the right control flow.
Default plugins, mapped to extension points.
| Plugin | Hooks | Notes |
|---|---|---|
| NodeAffinity | PreFilter, Filter, PreScore, Score | spec.affinity.nodeAffinity required + preferred terms |
| NodeResourcesFit | PreFilter, Filter, Score | CPU/memory/ephemeral-storage requests vs allocatable |
| NodeName | Filter | Honour spec.nodeName when explicitly set |
| NodePorts | PreFilter, Filter | hostPort conflicts on the node |
| NodeUnschedulable | Filter | Refuse cordoned nodes (spec.unschedulable=true) |
| TaintToleration | Filter, PreScore, Score | Taints on the node vs tolerations on the Pod |
| PodTopologySpread | PreFilter, Filter, PreScore, Score | topologySpreadConstraints — even spread across zones, hosts |
| InterPodAffinity | PreFilter, Filter, PreScore, Score | podAffinity / podAntiAffinity |
| VolumeBinding | PreFilter, Filter, Reserve, PreBind | PVC binding — runs the late-binding handshake |
| VolumeRestrictions | Filter | Single-writer volume types, RWO conflicts |
| VolumeZone | Filter | Topology — pod must land in the volume's zone |
| EBSLimits, GCEPD, etc. | Filter | Cloud-provider attach-limit ceilings |
| DefaultPreemption | PostFilter | Find victims when nothing fits. Disabled in some profiles |
| ImageLocality | Score | Prefer nodes that already have the image on disk |
| DefaultBinder | Bind | The bind subresource POST. Almost never replaced |
| SchedulingGates | PreEnqueue | Block a Pod from entering activeQ until gates clear (KEP-3521) |
Keep going.
From Pending Pod to running container — the kubelet SyncLoop, CRI calls, probes, the eviction decision tree.
Read 04Informers, listers, work queues, reconciliation. The pattern the scheduler is itself an instance of.
Read →Watch placement decisions cascade across a sustained Deployment rollout. Visceral version of every section above.
Open ↑All twelve sub-pages — five live, seven planned — and the system on one canvas.
Index