CPU Scheduler Simulator: four ways to share one core.
A CPU scheduler decides which ready job runs next on a core and for how long. Here five jobs share one CPU under four policies. FCFS runs them in arrival order to completion. RR rotates them in slices. Priority picks the most important one and starves the rest. CFS tracks how much CPU each one has used and always picks the one that's used least. Same workload, different stories.
| id | arrival | burst | rem | prio | nice | vrt | wait | done |
|---|---|---|---|---|---|---|---|---|
| A | 0 | 6 | 6 | 3 | 0 | 0.0 | 0 | — |
| B | 1 | 3 | 3 | 1 | -5 | 0.0 | 0 | — |
| C | 2 | 8 | 8 | 4 | 5 | 0.0 | 0 | — |
| D | 3 | 4 | 4 | 2 | 0 | 0.0 | 0 | — |
| E | 6 | 5 | 5 | 5 | 10 | 0.0 | 0 | — |
The jobs table is the workload — five tasks (A–E) with fixed arrival times and burst lengths, plus a priority and a nice value the relevant algorithms read. The Gantt strip below it is the single CPU's history: one coloured cell per tick, showing which job held the processor at that moment. The meta row tracks the running average wait, and under Round Robin you also get the live ready queue with the current slice counter.
Run FCFS first and watch the convoy effect: short job B sits idle for five ticks behind the long job A, dragging the average wait up. Switch to RR and the strip turns into stripes as jobs trade the CPU every quantum, and B finishes far sooner. The one to sit with is Priority versus CFS on the same five jobs. Priority lets the low-priority job E wait while higher ranks keep cutting in; CFS, picking whichever job has accumulated the least weighted runtime, never lets anyone starve. Same workload, opposite outcomes for the unlucky job.
Throughput, latency, fairness — pick two
Every scheduler in history is a different point on this triangle.
A scheduler exists because you have more runnable processes than CPUs. The question it answers — "which one next, and for how long?" — has no universally right answer. The trade is between three goals that conflict in pairs. Throughput wants long uninterrupted runs that keep caches hot and the pipeline full. Latency wants short bursts so interactive tasks respond quickly. Fairness wants no task ignored for too long. Optimise for any one and you lose the other two.
FCFS optimises throughput at the cost of everything else — once a long job is running,
short jobs queue behind it (the "convoy effect"). Round Robin optimises latency by
preempting every quantum ticks, but pays in context-switch overhead and loses
cache locality. Strict priority guarantees response time for important tasks and lets
everything else starve. CFS picks a fourth corner: weighted fairness, with throughput
recovered by tunable batch parameters and latency capped by a minimum sched granularity.
Every real-world scheduler — Windows NT, FreeBSD ULE, macOS XNU, mainframe lottery
schedulers — is a variant on the same dial-tuning exercise.
vruntime is the only number that matters
How Linux replaced fifteen heuristics with one tree.
Before CFS, Linux had the O(1) scheduler — separate active and expired arrays, dynamic
priority adjustment for interactivity, sleep-bonus heuristics, and a fistful of magic
constants tuned empirically. Ingo Molnár's CFS, merged in 2007, threw that out and
replaced it with a single data structure: a red-black tree of runnable tasks keyed by
vruntime, the virtual runtime each task has accumulated.
Each tick a task runs, its vruntime advances by delta_exec / weight, where
weight is derived from the task's nice value through a lookup table. A
nice-0 task has weight 1024; nice -5 has roughly 3,121; nice +5 has roughly 335. A
nice-5 task's vruntime climbs three times as fast as a nice-0 task's per tick of real
runtime, so the scheduler will pick the nice-0 task three times as often once they're
both equally "behind." The leftmost node in the tree is always next to run.
Starvation becomes impossible: every task accumulates vruntime monotonically when it
runs, so the longest-waiting task is always the smallest in the tree. Fairness becomes
tunable: cgroup cpu.weight scales the effective weight of an entire group,
letting you say "this container gets 4× the CPU of that one" with one number. The
downside, mostly cosmetic: deciding "is this task interactive?" is harder than under the
old heuristic-driven scheduler, so very latency-sensitive workloads sometimes still pick
SCHED_FIFO or SCHED_RR (the real-time classes) to bypass CFS entirely.
The bugs the scheduler causes
Three production-debugging patterns whose root cause lives here.
Priority inversion. Low-priority task holds a lock. High-priority task waits on the lock. Medium-priority task preempts the low-priority one and runs forever. The high-priority task is effectively blocked by the medium one. Fixed by priority inheritance (POSIX mutexes can request it) or priority ceilings. Mars Pathfinder famously rebooted itself in 1997 because of this.
Thundering herd at scheduler granularity. 1000 worker threads all sleep
on the same condition variable. The producer wakes "all" of them. They all wake, all race
for the lock, all but one lose, all go back to sleep. CPU spends most of its time on the
wake-and-sleep round trip. Fix: use pthread_cond_signal (wake one) or build
your own queue with explicit dispatch.
Noisy neighbour in containers. Two containers on the same node, both
using the default CFS class. One container is CPU-bound and well-behaved; the other goes
into an emergency state and spawns 10,000 threads. The CFS weights still favour each
cgroup equally, but the bad container's 10,000 threads collectively use far more CPU than
the good container's 4 threads do. Fix: cap with cpu.max (hard ceiling) not
just cpu.weight (relative share).