Container Scheduler Simulator: one pod at a time.
A scheduler is a function: given a queue of pods that want resources and a set of nodes that have resources, place each pod on a node. The Kubernetes version runs two phases — filter (predicates) then score (priorities). The ECS version picks a strategy — binpack, spread, or random. The autoscaler watches what didn't fit and grows the cluster. Queue tasks, pick a strategy, step the scheduler, and watch the packing arithmetic resolve.
The grid of node cards is your cluster — each shows CPU and memory bars and the
pods placed on it. The pending queue lists pods waiting for a home, with their
resource asks and any constraints (↔ means co-locate, ✗
means keep apart, tol is a taint toleration). Step the scheduler and it
picks one pending pod, filters out nodes that can't hold it, scores the survivors,
and commits to the winner. Score badges and the log show that arithmetic as it runs.
Step all once with binpack, then reset, switch to spread, and step all again — same
pods, but binpack stuffs them onto the fewest nodes while spread fans them wide.
Watch gpu-train: it only ever lands on the tainted GPU node because it's
the one pod that tolerates the taint. The surprise is what happens at capacity. Queue
a pod bigger than any free node and, with the autoscaler on, a brand-new node appears
rather than the pod sitting unschedulable. Turn the autoscaler off and the same pod
gets stuck, with the log naming exactly which constraint each node failed.
What container scheduling actually is
You have a set of nodes with finite CPU and memory, and a queue of containers that each declare what they need. The scheduler's job is to decide which container runs on which node. Framed cleanly, it's the multi-dimensional bin-packing problem — NP-hard in the general case, and even harder in practice because items arrive online (the queue is fed forever) and they're not interchangeable (a GPU job demands a GPU node, an anti-affinity pod refuses a specific neighbour).
Production schedulers don't try to find the optimum. They use online heuristics — fast filters, simple scoring functions, no backtracking. Kubernetes' kube-scheduler places roughly 100 pods per second on a typical cluster; ECS does similar with its task placement engine. The goal isn't perfect packing; it's "good enough, fast enough, deterministically enough that nothing falls over".
ECS task placement strategies
ECS exposes three strategies you can stack: binpack places tasks on the instance with the least available CPU or memory (pack tight, scale down cheaply), spread distributes tasks across an attribute — most commonly attribute:ecs.availability-zone for HA, or instanceId for even per-host distribution, and random picks any candidate. Stacks evaluate in order: [spread by AZ, binpack memory] means "first balance across AZs, then within an AZ pack tightly by memory".
Placement constraints are separate from strategies. distinctInstance guarantees no two tasks land on the same EC2 instance (think of a daemonset replica per host). memberOf uses cluster query language expressions like attribute:ecs.instance-type =~ c5.* to scope placement to a subset of the cluster. Constraints filter; strategies rank.
Kubernetes — predicates then priorities
kube-scheduler is two phases. Filter (formerly predicates) eliminates every node that cannot run the pod for hard reasons: insufficient CPU, insufficient memory, taint not tolerated, node selector not matched, volume topology incompatible, node not Ready. Score (formerly priorities) ranks the survivors on soft preferences: bin-pack vs spread, image locality (already-pulled image scores higher), inter-pod affinity, topology spread evenness, node resource balance.
Since 1.19 the whole thing is a pluggable framework — each phase has extension points (PreFilter, Filter, PostFilter, PreScore, Score, Reserve, Permit, Bind). The default plugins (NodeResourcesFit, ImageLocality, InterPodAffinity, NodeAffinity, TaintToleration, PodTopologySpread, VolumeBinding) cover 99% of cases. Custom plugins compile in or run as out-of-tree binaries.
Requests vs limits — what scheduling sees
This is the gotcha that bites the most. Requests are what the scheduler reserves on the node; the sum of requests on a node cannot exceed the node's allocatable capacity, by construction. Limits are what the runtime enforces — once the container is running, cgroups stops it from exceeding limits. The scheduler does not look at limits.
The consequence: a pod with requests.cpu: 100m and limits.cpu: 4 looks tiny to the scheduler and is allowed to land anywhere, but at runtime it can spike to four cores and starve its neighbours. Conversely, a pod with requests.cpu: 4 and limits.cpu: 4 reserves the four cores even if it sits idle. The honest sizing is: requests close to your steady-state, limits set somewhere you'd actually want the OOMKiller to fire.
Affinity, anti-affinity, taints, tolerations
Four declarative placement primitives that look similar and do different things. NodeAffinity — "schedule this pod only on nodes that match this label expression" — required or preferred. PodAffinity / PodAntiAffinity — "schedule this pod on a node where some other matching pod is already running" (or isn't). Often used for "co-locate cache with API" or "never put two replicas of a database on the same host".
Taints are the inverse direction. A node says "I have taint gpu=true:NoSchedule" — no pod lands here unless it has a matching toleration. The combination is how you carve a cluster: taint your GPU nodes so only GPU pods land there; taint your spot nodes so only spot-tolerant workloads land there; taint a node NoExecute to drain it.
The mnemonic: affinity is the pod choosing the node, taints are the node rejecting the pod, and topology spread is the cluster enforcing even distribution after both other rules pass.
Pod topology spread constraints
topologySpreadConstraints is the modern, correct way to say "I want my replicas evenly distributed across zones / racks / hostnames". It supersedes the older preferredDuringSchedulingIgnoredDuringExecution anti-affinity tricks and is faster because it doesn't require an O(N×M) cross-pod evaluation.
A typical constraint: maxSkew: 1, topologyKey: topology.kubernetes.io/zone, whenUnsatisfiable: DoNotSchedule. Reads as: across zones, no zone can have more than one extra replica compared to the least-loaded zone. If that constraint can't be satisfied, the pod stays pending. The softer ScheduleAnyway version places the pod on the best-effort node and just degrades the spread evenness.
Pod priority and preemption
In a full cluster, kube-scheduler can evict lower-priority pods to make room for higher-priority ones. PriorityClass objects map a name to a 32-bit integer priority; pods reference one. The two reserved classes — system-cluster-critical and system-node-critical — are the highest, used for things like the CNI plugin and kube-proxy that must never get evicted.
Preemption is greedy: the scheduler finds the smallest set of lower-priority pods on some node that, if evicted, would let the pending pod fit. It graceful-terminates them (SIGTERM, then SIGKILL after terminationGracePeriodSeconds). For batch and ML clusters this is the right behaviour; for production web traffic it's a footgun if priorities aren't set carefully — a misconfigured batch job can knock out half your API tier.
Cluster Autoscaler
Cluster Autoscaler watches the queue of pending pods. If a pod has been pending for more than 10 seconds (configurable) and the only reason is "insufficient capacity", it picks a node group (usually an AWS Auto Scaling Group, GKE node pool, or Azure VMSS) whose machine shape would fit the pod, and asks it to scale up. The new node joins the cluster in 30 seconds to a few minutes, depending on cloud.
Scale-down runs on a different loop. A node is candidate for removal when its utilisation drops below ~50% (the threshold is configurable) for 10 consecutive minutes and every pod on it could be rescheduled elsewhere. Cluster Autoscaler then drains the node and removes it from the ASG. Misconfigured PodDisruptionBudgets are the #1 reason scale-down stalls — if a PDB forbids any pod from being evicted, the autoscaler can't drain the node.
Karpenter — the modern alternative
Karpenter, open-sourced by AWS in 2021, replaces the "node group + ASG" model with a direct provisioner. Instead of pre-sizing a pool of m5.large machines and asking the ASG to add more, Karpenter looks at the actual pending pod's needs and provisions an instance type — possibly across many — that fits. A pod that wants 30 GiB RAM gets an r5.xlarge; a tiny sidecar might pack onto a t3.medium.
The result: faster scale-up (~30s end-to-end vs minutes for ASG-backed Cluster Autoscaler), better bin-packing (no wasted capacity from oversized standard nodes), and easier spot adoption (Karpenter handles spot interruption and replacement). The trade-off is operational opinionatedness — Karpenter wants to own provisioning decisions, which can conflict with org-mandated AMI / network configurations.
ECS vs Kubernetes — scheduling comparison
| ECS | Kubernetes | |
|---|---|---|
| Scheduler | AWS-managed control plane | kube-scheduler (self-hosted or managed) |
| Placement model | Strategies (binpack/spread/random) + constraints | Predicates (filter) + priorities (score) |
| Resource model | CPU units + memory (hard reservations) | Requests (scheduled) + limits (enforced) |
| Affinity | Limited — placement constraints via attributes | Full pod / node affinity DSL + topology spread |
| Autoscaling | Capacity Providers + ASG, or Fargate (per-task) | Cluster Autoscaler or Karpenter |
| Preemption | No native preemption; relies on Capacity Providers | PodPriority + preemption built in |
| Extensibility | Limited — strategies are fixed | Scheduler framework, custom plugins, multi-scheduler |
| Sweet spot | AWS-shop, simpler ops, small/medium fleets | Complex placement, hybrid/multi-cloud, large fleets |
A pod's life through the scheduler — code-level
What actually happens, in order, when you kubectl apply -f pod.yaml:
1. kubectl POSTs to kube-apiserver
2. apiserver writes Pod to etcd with .spec.nodeName = ""
3. kube-scheduler's informer fires on the new Pod
4. PreFilter plugins compute state (e.g., NodeResourcesFit caches request totals)
5. Filter plugins evaluate every node:
- NodeUnschedulable: is .spec.unschedulable == true?
- NodeName: does .spec.nodeName match (if set)?
- NodeAffinity: do node labels match nodeSelector / nodeAffinity?
- NodeResourcesFit: cpu/mem requests <= allocatable - already-requested?
- TaintToleration: does pod tolerate every NoSchedule taint?
- VolumeBinding: are PVs available in this node's topology?
- InterPodAffinity: do affinity / anti-affinity rules hold?
6. PreScore: prepare per-pod context for scoring
7. Score plugins return 0-100 per node:
- NodeResourcesBalancedAllocation: prefer nodes with balanced cpu:mem fill
- ImageLocality: prefer nodes that already have the container image
- InterPodAffinity: prefer nodes near desired pods
- TaintToleration: prefer nodes with fewer PreferNoSchedule taints
- PodTopologySpread: prefer nodes that improve spread
8. Normalize + weight scores, pick highest
9. Reserve plugins claim resources on the chosen node
10. Permit plugin can delay or reject (rarely used)
11. scheduler issues a Bind (PATCH .spec.nodeName)
12. kubelet on that node sees the Pod via its watch, pulls images, starts containers Steps 4 through 8 run in roughly 5–20ms per pod on a 1000-node cluster, with caching tricks that mean only "dirty" nodes get re-evaluated. Step 11's optimistic bind can fail if etcd lost the race — the scheduler retries.
Production failure modes
- Burst pending under deploy. A rollout creates 200 new pods in one second; the scheduler queues them; if any of them have anti-affinity rules, the scoring phase becomes O(pods × nodes × constraints) and latency spikes from milliseconds to seconds. Watch the
scheduler_pending_podsandscheduler_e2e_scheduling_duration_secondsmetrics. - Missing resource requests. A team ships pods with no
requests.cpu. The scheduler treats them as zero-CPU, packs them tightly, and at runtime they fight for cores. Fix at the LimitRange admission layer, not in code review. - Fragmentation. Lots of mid-sized pods on a fleet of large nodes — each node has 1.5 cores free, but every pending pod needs 2. Total free capacity exists; no node has it contiguously. Cluster Autoscaler / Karpenter pick this up and scale.
- Hotspot nodes. Spread strategy interacts badly with cordoned / drained nodes — every pod avoids them, the rest get packed harder, one node turns red. The kubelet's Eviction Manager kicks in and starts OOM-killing.
- Stuck pending pods. A PVC is bound to a zone-A volume; the pod has anti-affinity preventing it from landing on the only node in zone A; ergo unschedulable forever. The scheduler logs why ("no nodes available that match all of the predicates"); few teams read those logs until pages start firing.
- Priority inversion. A batch job has higher priority than a frontend service because someone forgot to set
priorityClassNameon the frontend Deployment, so the default class wins. Batch evicts frontend during peak. Always set explicit priorities on production workloads.
Real-world fleets
Borg (Google, 2015). The paper "Large-scale cluster management at Google with Borg" by Verma et al. describes ten years of running tens of thousands of machines through one scheduler. Kubernetes inherits Borg's DNA almost line for line — pods are jobs-with-tasks, kube-scheduler is borgmaster's scheduler, kubelet is borglet.
Airbnb on Kubernetes. Public engineering posts describe a fleet of ~7,000 nodes across multiple clusters, Karpenter for autoscaling, custom scheduler plugins for cost-aware placement and spot-vs-on-demand routing.
Adobe and Spotify. Both run multi-thousand-node clusters and have written about scheduler tuning: pre-filter caching, custom priority functions for compliance (workloads tagged with PII can only land on a certain class of nodes), and the operational complexity of multi-scheduler topologies.
Tuning knobs that matter
- Set requests honestly. The number the scheduler schedules on. Use VPA recommendations or actual usage data; don't copy-paste from someone else's manifest.
- Pick a default strategy. Binpack saves money on autoscaled fleets (fewer nodes, denser packing). Spread improves HA. Most production clusters want spread by zone + binpack within zone.
- Topology spread, not affinity. For "spread my replicas evenly", use topologySpreadConstraints. Anti-affinity is older, slower, and the rule semantics confuse people.
- PriorityClass everywhere. Set explicit priorities on every workload. Batch < default < production < critical. Document the ladder.
- PodDisruptionBudget on every Deployment. Otherwise autoscaler scale-down or node upgrades can take all your replicas down at once.
- Karpenter over CA if you can. Faster scale-up, better packing, less node-group bookkeeping. Migration is non-trivial but pays back within months on a fleet of meaningful size.
Further reading
- Verma et al., "Large-scale cluster management at Google with Borg" (EuroSys 2015). The paper Kubernetes descended from. Read sections 2 and 5 for the scheduling and priority model.
- kube-scheduler concept docs. The Filter / Score split, plugin extension points, and the framework architecture.
- Scheduling framework reference. All the extension points with rationale.
- ECS task placement guide. Strategies, constraints, attribute syntax.
- Karpenter concepts. Provisioners, node pools, consolidation, the disruption controller.
- k8s rollout simulator. What happens after the scheduler places a pod — Deployment, ReplicaSet, rolling update.
- Pod eviction simulator. Pressure-based and priority-based eviction, kubelet's eviction manager.
- AWS containers codex. ECS, EKS, Fargate, and the trade-offs between them.