Scheduling
A modern Linux box runs tens of thousands of threads on a few dozen cores. The scheduler is the kernel subsystem that decides, at every preemption point, which one runs next, for how long, and on which CPU. It is mostly invisible until it isn't, and then it shapes tail latency, throughput, and power as much as any other piece of the kernel.
More runnable threads than CPUs.
The scheduling problem in one sentence: at any moment there are usually more threads in
the R state than there are CPUs to run them. The scheduler has to pick a
subset and rotate through the rest. The choices it makes optimise for a set of goals
that constantly conflict with each other.
Throughput: finish as much work per second as possible. Latency: let interactive and request-handling threads run soon after they wake. Fairness: every runnable thread gets a turn proportional to its weight. Energy: keep cores parked when load is low, race-to-idle when it's high. Cache and NUMA locality: keep a thread on the CPU whose L1/L2 and memory node it last touched. Real-time guarantees: when a thread declares a deadline, hit it.
Maximising any one degrades the others. A scheduler tuned purely for throughput will migrate aggressively and ignore wakeup latency; one tuned purely for latency will leave caches cold and waste cycles. The history of Linux scheduling is essentially the history of trying to land the right compromise as hardware and workloads change.
One framing helps before any of the algorithms. The unit the scheduler actually moves
around is not a process but a thread.
A process is the
container for memory and file descriptors; the threads inside it are the things that hold a
register set and a program counter and therefore the things that can be placed on a CPU.
On Linux the two are the same kind of kernel object, a task_struct, and the
scheduler treats a single-threaded process and one thread of a many-threaded process
identically. When this page says "task" it means one schedulable thread.
What the scheduler actually does.
Strip the subsystem down and the job is small. On each CPU, whenever the current thread can
no longer keep the core, the kernel calls schedule(). That function asks the
scheduling classes, in priority order, "do you have a runnable task for this CPU?" The first
class that says yes hands back a thread, the kernel saves the outgoing thread's registers,
loads the incoming one's, switches the page-table root if the address space changed, and
returns into the new thread. That swap is a context switch, and the choice
of which thread to return into is the whole question this page is about.
A thread gives up the CPU for one of two reasons. It either blocks on its own,
it called read() on a socket with no data, took a lock that was held, or slept
on a timer, or it is preempted, forced off by the kernel even though it still had
work to do. The first is voluntary and as old as multitasking. The second is what makes a
modern system feel responsive, and it depends on a piece of hardware that fires whether the
running thread likes it or not.
Preemptive vs cooperative, and the timer tick.
Early systems and a few embedded kernels are cooperative: a thread keeps the CPU until it calls a yield function or blocks. This is cheap and predictable, but one buggy loop with no yield freezes the whole machine, because nothing can take the core back. Classic Mac OS and Windows 3.x worked this way, and anyone who lived through them remembers the result. The same trade-off shows up inside language runtimes: a userspace scheduler that can only switch at function calls or allocations is cooperative by nature, which is exactly the problem the Go scheduler spent years solving before it gained asynchronous preemption.
Linux and every other general-purpose kernel are preemptive: the kernel can
stop a thread at almost any point and run a different one. The hardware that makes this
possible is a programmable timer that fires a periodic interrupt, the timer tick. On
each tick the kernel updates accounting, charges the running thread for the time it used, and
checks whether it should be preempted. If a higher-priority thread woke up, or the current
thread has used its slice, the tick sets a flag (TIF_NEED_RESCHED) and the kernel
calls schedule() on the way back to userspace.
The tick rate is set by CONFIG_HZ: 100, 250, or 1000 ticks per second on most
builds. A higher rate means finer-grained preemption and snappier interactivity, at the cost
of more interrupts and slightly worse throughput. Servers often run a lower HZ or
go tickless with nohz_full, which stops the periodic tick entirely on
CPUs that have exactly one runnable thread. There is nothing to switch to, so the interrupt
is pure overhead. That is why the low-latency tuning later on pairs isolcpus with
nohz_full: a dedicated core with no tick is a core that runs your thread without
ever being asked to reconsider.
Slices, quanta, and starvation.
A time slice (or quantum) is how long a thread is allowed to run before the scheduler considers handing the CPU elsewhere. Too short and the machine spends its cycles context-switching instead of doing work; too long and interactive threads wait behind a batch job and the system feels sticky. The whole reason CFS and EEVDF compute slices from a target latency rather than using a fixed number is that the right slice depends on how many threads are competing.
The failure mode at the other end is starvation: a runnable thread that never
gets the CPU because something else always outranks it. A strict-priority scheduler starves
low-priority work whenever high-priority work is available; a pure shortest-job-first scheme
starves long jobs when short ones keep arriving. Fair schedulers avoid starvation almost by
construction. A thread that hasn't run accumulates the least virtual time and therefore sorts
to the front. The real-time classes do not, which is exactly why Linux throttles runaway
real-time threads with sched_rt_runtime_us, covered below. Starvation is not an
exotic bug; it is the default outcome of any greedy policy that lacks an aging mechanism.
Priorities and nice.
Not every thread deserves an equal share, and the knob for saying so on normal tasks is the nice value, an integer from -20 (greediest) to +19 (most generous). The name is literal: a high nice value means the thread is being nice to its neighbours, asking for less. Nice maps to an internal weight on a roughly geometric scale. Each step of one nice level changes a thread's CPU share by about 10%, so a nice -5 thread gets roughly ten times the CPU of a nice +5 thread under contention, and a difference of one level barely registers when the machine is idle.
This is the same arithmetic the cgroup CPU controller uses on whole groups, just applied per
thread. It is also proportional, not absolute: nice only matters when threads are competing. A
nice +19 background job still gets the full CPU if nothing else wants it. That property is what
makes nice safe to hand out liberally. It cannot make an idle machine slower, only redistribute
a busy one. The companion ionice does the same job for disk I/O, and the two are
often set together for batch work that should yield on every axis.
Classic algorithms: round robin to MLFQ.
It helps to see the ideas CFS inherited. The simplest fair scheme is round robin: keep a queue, give the head thread one fixed quantum, then move it to the back and take the next. Every thread gets an equal turn and nothing starves, but a quantum that suits a batch job is far too coarse for an interactive one, and round robin has no way to tell them apart.
The classic answer is the multilevel feedback queue (MLFQ), which is the scheduler most operating-systems courses teach and the one Windows and the older Unix schedulers approximate. It keeps several queues at different priorities. New threads start high. A thread that uses its whole quantum is assumed to be CPU-bound and gets bumped down a level; a thread that blocks before its quantum is up is assumed to be interactive and stays high. The scheduler always runs the highest non-empty queue, round-robin within it. The effect is that I/O-bound, interactive threads float to the top and get served fast, while compute-heavy threads sink and run in the gaps.
MLFQ works but is full of tuning parameters: how many levels, what quantum each, how often to
reset a sunk thread back to the top so it doesn't starve. Get the gaming protection wrong and a
thread can keep itself high by blocking for a microsecond just before each quantum ends. CFS's
insight was to throw out the discrete levels and replace them with one continuous quantity,
vruntime, that captures the same intent (run whoever has had the least) without
any of the level-tuning. The classic algorithms are still worth knowing, because the goals they
were chasing are the goals every scheduler still trades against.
What a context switch costs.
Switching threads is not free, and the direct cost is the smaller half. Saving and restoring registers and running the scheduler is a few hundred nanoseconds. The expensive part is what the new thread does to the caches. The outgoing thread leaves the L1 and L2 full of its own data; the incoming thread starts cold and pays cache-miss latency, tens to hundreds of cycles per miss, until it has refilled them. If the switch also changes address space, the TLB may be flushed too, and the new thread re-walks page tables for its early accesses.
This is why the scheduler cares so much about where a thread runs and not only whether it runs. Keeping a thread on the CPU it last used means its data may still be warm; bouncing it to an idle core balances load but throws that warmth away. The whole migration policy described later is an attempt to weigh these costs against each other, and it is also why context-switch-heavy designs, a thread per connection or a lock everyone contends, perform worse than their CPU graphs suggest. The cycles are real but they hide in cache misses, not in the scheduler's own runtime.
CFS — the completely fair scheduler.
From kernel 2.6.23 (October 2007) through 6.5 the default
scheduling class for normal tasks was the Completely Fair Scheduler, written by
Ingo Molnár as a response to Con Kolivas's BFS work and the limits of the
O(1) scheduler before it. CFS models an idealised multitasking CPU where every runnable
thread gets exactly an equal share of the processor.
Each task carries a vruntime, a weighted count of how much CPU it has
consumed. When a task runs for some wall time Δ, its vruntime
advances by Δ · NICE_0_LOAD / task.weight, so a task with a heavier weight
(lower nice value) accumulates virtual time more slowly. The scheduler keeps all runnable
tasks on a per-CPU red-black tree sorted by vruntime, and
the next task to run is always the leftmost, the one that has had the least so far.
Time slices aren't fixed. A target scheduling latency (default 6 ms, scaled by
sched_latency_ns) is divided among all runnable tasks weighted by their nice
values; the result is each task's slice. With three equal-weight tasks the slice is 2 ms;
with thirty it's 200 µs (floored at sched_min_granularity_ns, default 750 µs,
to avoid context-switch storms).
EEVDF — the 6.6+ replacement.
In kernel 6.6 (October 2023) Peter Zijlstra replaced CFS's selection logic with EEVDF, Earliest Eligible Virtual Deadline First, a scheme first proposed by Stoica and Abdel-Wahab in 1996. The change wasn't a rewrite of the whole subsystem. Task groups, runqueues, load tracking, and the red-black tree all stayed. What changed was how the next task is picked.
Under EEVDF each task has a virtual deadline equal to its eligible time plus its
requested slice divided by its weight. The scheduler picks the eligible task with the
earliest deadline. Tasks that want short slices get short deadlines and run sooner; tasks
that want long slices wait longer but run for longer. This gives the system a single knob
(sched_attr.sched_runtime) for trading latency against context-switch cost,
something CFS lacked.
vruntime,
which meant a chatty server thread that woke briefly and slept again would compete on
equal terms with a long-running batch task. To keep wakeup latency low CFS bolted on
heuristics: WAKEUP_PREEMPT, GENTLE_FAIR_SLEEPERS,
min_vruntime clamps, that interacted in subtle ways and made tuning a black
art. EEVDF makes the latency-vs-throughput trade explicit per task. Brendan Gregg's
post-6.6 analyses show wakeup latency improvements of 10–30% on Netflix's frontends.Real-time classes — FIFO, RR, DEADLINE.
Above CFS/EEVDF sit the real-time scheduling classes. They are strict: any runnable RT task preempts any normal task instantly, regardless of weights. Linux exposes three:
| Class | Policy | Behaviour |
|---|---|---|
SCHED_FIFO | 1–99 priority | Runs until it blocks or yields. Same-priority tasks queue FIFO. |
SCHED_RR | 1–99 priority | Like FIFO but same-priority tasks round-robin every sched_rr_timeslice_ms (default 100 ms). |
SCHED_DEADLINE | (runtime, deadline, period) | EDF — admission-controlled, gives each task a CPU budget per period. |
SCHED_FIFO and SCHED_RR are blunt instruments. A FIFO thread
that never yields will starve every normal task on its CPU forever. The kernel's
sched_rt_runtime_us throttle (default 950 ms per 1 s) exists exactly because
people kept shipping FIFO threads with infinite loops. SCHED_DEADLINE, added
in 3.14, is the modern answer: the task declares a runtime and deadline, the kernel admits
it only if it fits, and it gets exactly that much CPU per period, no more, no less.
The classic real-time pitfall is priority inversion: a high-priority
task waits on a lock held by a low-priority task that's been preempted by a medium one.
Linux's rt_mutex implements priority inheritance to break the cycle, which
is why PTHREAD_PRIO_INHERIT mutexes are mandatory in any serious RT app.
Multi-core, NUMA, and migration.
Linux runs one runqueue per CPU. Wakeups go to a CPU chosen by
select_task_rq_fair(), which considers idleness, cache topology, the waker's
CPU, and NUMA distance. Periodically the load balancer (load_balance()) walks
up the scheduling-domain hierarchy (SMT siblings → core complex → LLC → NUMA node) and
moves tasks from busy queues to idle ones.
The central trade-off: migrating a task balances load but costs you the L1/L2 cache it built up on its old CPU, plus a TLB flush, plus possibly remote-memory access if the migration crosses NUMA nodes. On a modern Xeon a cold L2 can cost hundreds of cycles per miss for the first millisecond. CFS/EEVDF therefore prefer not to migrate when the imbalance is small, and the kernel tracks cache hot tasks (recently run on a CPU) and prefers to leave them in place.
taskset,
sched_setaffinity(), or cpuset cgroups eliminates migration
cost but disables the load balancer for that task. If its CPU is busy with kernel work
or another pinned thread, it just waits. Pin only when you can give the thread the CPU
exclusively (isolcpus + nohz_full), or for short-lived bursts
where cache locality dominates. For general server workloads the default load balancer
beats hand-tuning almost every time.On NUMA systems the scheduler also considers memory locality. numa_balancing
(enabled by default on multi-node hosts since 3.13) periodically samples a task's pages,
records which node they live on, and migrates either the task or its pages to keep them
co-located. It's the reason numactl --membind sometimes makes things slower:
you've pinned the memory but not the thread, so the scheduler keeps moving the thread
away from it.
cgroup v2 — how containers get their share.
Above the per-task scheduler sits the cgroup v2 CPU controller. It implements
weighted-fair-queueing across cgroups: every cgroup has a cpu.weight (default
100, range 1–10 000), and within a parent, children share CPU in proportion to their
weights. It's the same maths as nice values, applied to groups of tasks.
Hard ceilings come from cpu.max, a (quota, period) pair. 200000
100000 means the cgroup gets at most 200 ms of CPU per 100 ms wall window, i.e.
2 CPUs worth. This is what Kubernetes cpu: "2" compiles to. The catch is
throttling: if the cgroup uses its quota in the first 40 ms of the period it sleeps for
60 ms, and tail latency spikes accordingly. See the
containers deep-dive for the full chain from a Pod spec to cpu.max.
# cgroup v2 — weighted fair share across two services
$ cd /sys/fs/cgroup/myservice
$ cat cpu.weight
100
# Give web 4x the CPU share of batch under contention
$ echo 400 > /sys/fs/cgroup/web/cpu.weight
$ echo 100 > /sys/fs/cgroup/batch/cpu.weight
# Hard cap web at 2 CPUs (200 ms per 100 ms)
$ echo "200000 100000" > /sys/fs/cgroup/web/cpu.max
# Throttling stats — if nr_throttled is climbing, you're capped
$ cat /sys/fs/cgroup/web/cpu.stat
usage_usec 4523109832
nr_periods 9842
nr_throttled 312
throttled_usec 18204000Latency-aware knobs.
Linux exposes several lower-key policies for cases where the default isn't quite right:
SCHED_BATCH— like normal but the scheduler assumes the task is non-interactive and won't try to give it a wakeup latency boost. Good for background analytics jobs that shouldn't compete with request-handling threads.SCHED_IDLE— lowest possible priority, runs only when nothing else wants the CPU. Use for opportunistic work like indexers and prefetchers.isolcpus=2-7at boot removes CPUs 2–7 from the general scheduler's domain; nothing runs there unless explicitly pinned. Combined withnohz_full=2-7andrcu_nocbs=2-7you get tickless cores — the basis of every low-latency trading stack and DPDK pipeline.irqaffinityin/proc/irq/N/smp_affinitypins hardware interrupts to specific CPUs. NIC RSS queues usually want their IRQs on the same CPUs as the userspace threads consuming them.- Kernel preemption modes —
PREEMPT_NONE,PREEMPT_VOLUNTARY,PREEMPT,PREEMPT_RT. Mainline 6.x shipsPREEMPT_DYNAMIC, switchable at boot.PREEMPT_RT(merged piecewise from 5.x onwards, mostly complete in 6.12) makes nearly every kernel codepath preemptible and turns spinlocks into sleeping mutexes.
Other operating systems.
Schedulers diverge sharply across kernels. Windows uses a priority-based
SMP scheduler with 32 levels (0–31, the top 16 reserved for real-time) and short-term
priority boosts on I/O completion, foreground focus, and starvation, a heuristic system
that's existed largely unchanged since NT 3.5. macOS XNU is a hybrid:
the Mach layer schedules kernel threads, BSD wraps them as processes, and a layered
scheduler implements Apple's Quality of Service classes
(USER_INTERACTIVE, UTILITY, BACKGROUND) on top.
FreeBSD's ULE scheduler, replaced 4BSD as default in 7.1, uses two per-CPU runqueues (current and next) plus an interactivity score that promotes I/O-bound threads, conceptually closer to early CFS than to EEVDF. BeOS/Haiku famously implemented per-thread pervasive multithreading and a scheduler optimised for desktop responsiveness; many of its ideas (per-CPU runqueues, cheap context switches) eventually found their way into Linux. Brendan Gregg's Systems Performance chapter 6 has a side-by-side comparison if you want the full picture.
Debugging scheduling.
Most scheduling pathologies show up as either uneven load (some CPUs pegged, others idle),
migration storms (high sched.migrations), or runqueue contention (long
wakeup latency). The standard kit:
perf sched record+perf sched latency— per-task wakeup and on-CPU times.perf sched mapdraws a per-CPU timeline.schedviz(Google) andkernelshark— visual ftrace timelines, useful for finding the one thread that's blocking everything else.chrt— read or set scheduling class and priority for a PID./proc/sched_debug— the kitchen sink. Per-CPU runqueue contents, vruntime, load weights, group hierarchy.bpftraceone-liners on thesched:tracepoints —sched_switch,sched_wakeup,sched_migrate_task.
# Move PID 1234 to real-time round-robin priority 50
$ sudo chrt -r -p 50 1234
# Inspect current policy
$ chrt -p 1234
pid 1234's current scheduling policy: SCHED_RR
pid 1234's current scheduling priority: 50
# Run a command niced down so it loses every contention
$ nice -n 19 ./batch-job
# Per-CPU runqueue depths and migration counts
$ grep -E 'cpu#|nr_running|nr_migrations' /proc/sched_debug | head -40
# Live wakeup-latency histogram (eBPF)
$ sudo bpftrace -e 'tracepoint:sched:sched_wakeup { @[comm] = hist(nsecs - args->prev_state); }'Production mistakes.
The same four or five patterns account for most scheduling-related incidents:
- Oversubscribed containers with no CPU limits. A noisy neighbour with
cpu.weight = 100can hog every CPU on the node and starve everything else of latency, even thoughtopshows no throttling. Always set both weights and caps on shared nodes. - SCHED_FIFO threads that never yield. An infinite loop at FIFO 99
will lock up a CPU until
sched_rt_runtime_usthrottles it, and if the operator turned the throttle off ("for latency"), the only way out is the watchdog. Always pair RT priority with a sleep or condition variable. cpuset.cpus.partitionmisconfig. cgroup v2 partitions let you carve out exclusive CPU sets, but a typo'drootpartition can steal CPUs from the system root and brick everything not in the partition. Validate withcat /sys/fs/cgroup/cpuset.cpus.effectivebefore declaring victory.- NIC IRQs all on CPU 0. The default IRQ affinity on most distros
pins every NIC queue to CPU 0; under load that one CPU softirq-spins at 100% while
the others coast. Fix with
set_irq_affinity.shfrom the NIC vendor or theirqbalancedaemon's hints. - Pinning without isolation. Pinning a thread to CPU 5 doesn't stop
the kernel from scheduling other tasks there. For dedicated cores you need
isolcpus=5at boot or acpusetcgroup withcpuset.cpus.partition=isolated.
Further reading.
- Brendan Gregg — EEVDF and Linux scheduler analysis — the most accessible writeup of the 6.6 transition, with flame graphs.
- Lozi et al. — The Linux Scheduler: A Decade of Wasted Cores (EuroSys 2016) — the paper that documented four real CFS bugs costing 13–24% of throughput, and prompted years of fixes.
- Stoica & Abdel-Wahab — Earliest Eligible Virtual Deadline First (1996) — the proposal Peter Zijlstra reached for when CFS ran out of road.
- Jeff Roberson — ULE: A Modern Scheduler for FreeBSD — the design doc for FreeBSD's default scheduler. Useful contrast with CFS.
- kernel.org — CFS design document and the SCHED_DEADLINE document — the source of truth for what the kernel actually does.
- LWN — An EEVDF CPU scheduler for Linux — Jonathan Corbet's pre-merge writeup of EEVDF, with worked examples.
- Semicolony — Thread pools — how userspace adapts to whatever the kernel scheduler is doing.
- Semicolony — Containers — the cgroup v2 CPU controller from the container side.