03 / 10

Internals / 03

Scheduling

A modern Linux box runs tens of thousands of threads on a few dozen cores. The scheduler is the kernel subsystem that decides, at every preemption point, which one runs next, for how long, and on which CPU. It is mostly invisible until it isn't, and then it shapes tail latency, throughput, and power as much as any other piece of the kernel.

More runnable threads than CPUs.

The scheduling problem in one sentence: at any moment there are usually more threads in the R state than there are CPUs to run them. The scheduler has to pick a subset and rotate through the rest. The choices it makes optimise for a set of goals that constantly conflict with each other.

Throughput: finish as much work per second as possible. Latency: let interactive and request-handling threads run soon after they wake. Fairness: every runnable thread gets a turn proportional to its weight. Energy: keep cores parked when load is low, race-to-idle when it's high. Cache and NUMA locality: keep a thread on the CPU whose L1/L2 and memory node it last touched. Real-time guarantees: when a thread declares a deadline, hit it.

Maximising any one degrades the others. A scheduler tuned purely for throughput will migrate aggressively and ignore wakeup latency; one tuned purely for latency will leave caches cold and waste cycles. The history of Linux scheduling is essentially the history of trying to land the right compromise as hardware and workloads change.

One framing helps before any of the algorithms. The unit the scheduler actually moves around is not a process but a thread. A process is the container for memory and file descriptors; the threads inside it are the things that hold a register set and a program counter and therefore the things that can be placed on a CPU. On Linux the two are the same kind of kernel object, a task_struct, and the scheduler treats a single-threaded process and one thread of a many-threaded process identically. When this page says "task" it means one schedulable thread.

What the scheduler actually does.

Strip the subsystem down and the job is small. On each CPU, whenever the current thread can no longer keep the core, the kernel calls schedule(). That function asks the scheduling classes, in priority order, "do you have a runnable task for this CPU?" The first class that says yes hands back a thread, the kernel saves the outgoing thread's registers, loads the incoming one's, switches the page-table root if the address space changed, and returns into the new thread. That swap is a context switch, and the choice of which thread to return into is the whole question this page is about.

A thread gives up the CPU for one of two reasons. It either blocks on its own, it called read() on a socket with no data, took a lock that was held, or slept on a timer, or it is preempted, forced off by the kernel even though it still had work to do. The first is voluntary and as old as multitasking. The second is what makes a modern system feel responsive, and it depends on a piece of hardware that fires whether the running thread likes it or not.

Preemption in one picture. The timer tick interrupts a running thread, the kernel runs the scheduler, and a different thread is loaded. Voluntary blocking takes the same kernel path without the timer.

Preemptive vs cooperative, and the timer tick.

Early systems and a few embedded kernels are cooperative: a thread keeps the CPU until it calls a yield function or blocks. This is cheap and predictable, but one buggy loop with no yield freezes the whole machine, because nothing can take the core back. Classic Mac OS and Windows 3.x worked this way, and anyone who lived through them remembers the result. The same trade-off shows up inside language runtimes: a userspace scheduler that can only switch at function calls or allocations is cooperative by nature, which is exactly the problem the Go scheduler spent years solving before it gained asynchronous preemption.

Linux and every other general-purpose kernel are preemptive: the kernel can stop a thread at almost any point and run a different one. The hardware that makes this possible is a programmable timer that fires a periodic interrupt, the timer tick. On each tick the kernel updates accounting, charges the running thread for the time it used, and checks whether it should be preempted. If a higher-priority thread woke up, or the current thread has used its slice, the tick sets a flag (TIF_NEED_RESCHED) and the kernel calls schedule() on the way back to userspace.

The tick rate is set by CONFIG_HZ: 100, 250, or 1000 ticks per second on most builds. A higher rate means finer-grained preemption and snappier interactivity, at the cost of more interrupts and slightly worse throughput. Servers often run a lower HZ or go tickless with nohz_full, which stops the periodic tick entirely on CPUs that have exactly one runnable thread. There is nothing to switch to, so the interrupt is pure overhead. That is why the low-latency tuning later on pairs isolcpus with nohz_full: a dedicated core with no tick is a core that runs your thread without ever being asked to reconsider.

Slices, quanta, and starvation.

A time slice (or quantum) is how long a thread is allowed to run before the scheduler considers handing the CPU elsewhere. Too short and the machine spends its cycles context-switching instead of doing work; too long and interactive threads wait behind a batch job and the system feels sticky. The whole reason CFS and EEVDF compute slices from a target latency rather than using a fixed number is that the right slice depends on how many threads are competing.

The failure mode at the other end is starvation: a runnable thread that never gets the CPU because something else always outranks it. A strict-priority scheduler starves low-priority work whenever high-priority work is available; a pure shortest-job-first scheme starves long jobs when short ones keep arriving. Fair schedulers avoid starvation almost by construction. A thread that hasn't run accumulates the least virtual time and therefore sorts to the front. The real-time classes do not, which is exactly why Linux throttles runaway real-time threads with sched_rt_runtime_us, covered below. Starvation is not an exotic bug; it is the default outcome of any greedy policy that lacks an aging mechanism.

Priorities and nice.

Not every thread deserves an equal share, and the knob for saying so on normal tasks is the nice value, an integer from -20 (greediest) to +19 (most generous). The name is literal: a high nice value means the thread is being nice to its neighbours, asking for less. Nice maps to an internal weight on a roughly geometric scale. Each step of one nice level changes a thread's CPU share by about 10%, so a nice -5 thread gets roughly ten times the CPU of a nice +5 thread under contention, and a difference of one level barely registers when the machine is idle.

This is the same arithmetic the cgroup CPU controller uses on whole groups, just applied per thread. It is also proportional, not absolute: nice only matters when threads are competing. A nice +19 background job still gets the full CPU if nothing else wants it. That property is what makes nice safe to hand out liberally. It cannot make an idle machine slower, only redistribute a busy one. The companion ionice does the same job for disk I/O, and the two are often set together for batch work that should yield on every axis.

Classic algorithms: round robin to MLFQ.

It helps to see the ideas CFS inherited. The simplest fair scheme is round robin: keep a queue, give the head thread one fixed quantum, then move it to the back and take the next. Every thread gets an equal turn and nothing starves, but a quantum that suits a batch job is far too coarse for an interactive one, and round robin has no way to tell them apart.

The classic answer is the multilevel feedback queue (MLFQ), which is the scheduler most operating-systems courses teach and the one Windows and the older Unix schedulers approximate. It keeps several queues at different priorities. New threads start high. A thread that uses its whole quantum is assumed to be CPU-bound and gets bumped down a level; a thread that blocks before its quantum is up is assumed to be interactive and stays high. The scheduler always runs the highest non-empty queue, round-robin within it. The effect is that I/O-bound, interactive threads float to the top and get served fast, while compute-heavy threads sink and run in the gaps.

A multilevel feedback queue. The scheduler runs the highest non-empty level; threads that burn a full quantum sink, threads that block early stay near the top. CFS replaced the discrete levels with a single continuous measure, vruntime.

MLFQ works but is full of tuning parameters: how many levels, what quantum each, how often to reset a sunk thread back to the top so it doesn't starve. Get the gaming protection wrong and a thread can keep itself high by blocking for a microsecond just before each quantum ends. CFS's insight was to throw out the discrete levels and replace them with one continuous quantity, vruntime, that captures the same intent (run whoever has had the least) without any of the level-tuning. The classic algorithms are still worth knowing, because the goals they were chasing are the goals every scheduler still trades against.

What a context switch costs.

Switching threads is not free, and the direct cost is the smaller half. Saving and restoring registers and running the scheduler is a few hundred nanoseconds. The expensive part is what the new thread does to the caches. The outgoing thread leaves the L1 and L2 full of its own data; the incoming thread starts cold and pays cache-miss latency, tens to hundreds of cycles per miss, until it has refilled them. If the switch also changes address space, the TLB may be flushed too, and the new thread re-walks page tables for its early accesses.

This is why the scheduler cares so much about where a thread runs and not only whether it runs. Keeping a thread on the CPU it last used means its data may still be warm; bouncing it to an idle core balances load but throws that warmth away. The whole migration policy described later is an attempt to weigh these costs against each other, and it is also why context-switch-heavy designs, a thread per connection or a lock everyone contends, perform worse than their CPU graphs suggest. The cycles are real but they hide in cache misses, not in the scheduler's own runtime.

CFS — the completely fair scheduler.

From kernel 2.6.23 (October 2007) through 6.5 the default scheduling class for normal tasks was the Completely Fair Scheduler, written by Ingo Molnár as a response to Con Kolivas's BFS work and the limits of the O(1) scheduler before it. CFS models an idealised multitasking CPU where every runnable thread gets exactly an equal share of the processor.

Each task carries a vruntime, a weighted count of how much CPU it has consumed. When a task runs for some wall time Δ, its vruntime advances by Δ · NICE_0_LOAD / task.weight, so a task with a heavier weight (lower nice value) accumulates virtual time more slowly. The scheduler keeps all runnable tasks on a per-CPU red-black tree sorted by vruntime, and the next task to run is always the leftmost, the one that has had the least so far.

Time slices aren't fixed. A target scheduling latency (default 6 ms, scaled by sched_latency_ns) is divided among all runnable tasks weighted by their nice values; the result is each task's slice. With three equal-weight tasks the slice is 2 ms; with thirty it's 200 µs (floored at sched_min_granularity_ns, default 750 µs, to avoid context-switch storms).

EEVDF — the 6.6+ replacement.

In kernel 6.6 (October 2023) Peter Zijlstra replaced CFS's selection logic with EEVDF, Earliest Eligible Virtual Deadline First, a scheme first proposed by Stoica and Abdel-Wahab in 1996. The change wasn't a rewrite of the whole subsystem. Task groups, runqueues, load tracking, and the red-black tree all stayed. What changed was how the next task is picked.

Under EEVDF each task has a virtual deadline equal to its eligible time plus its requested slice divided by its weight. The scheduler picks the eligible task with the earliest deadline. Tasks that want short slices get short deadlines and run sooner; tasks that want long slices wait longer but run for longer. This gives the system a single knob (sched_attr.sched_runtime) for trading latency against context-switch cost, something CFS lacked.

Why CFS was replaced. CFS picked strictly by lowest vruntime, which meant a chatty server thread that woke briefly and slept again would compete on equal terms with a long-running batch task. To keep wakeup latency low CFS bolted on heuristics: WAKEUP_PREEMPT, GENTLE_FAIR_SLEEPERS, min_vruntime clamps, that interacted in subtle ways and made tuning a black art. EEVDF makes the latency-vs-throughput trade explicit per task. Brendan Gregg's post-6.6 analyses show wakeup latency improvements of 10–30% on Netflix's frontends.

Real-time classes — FIFO, RR, DEADLINE.

Above CFS/EEVDF sit the real-time scheduling classes. They are strict: any runnable RT task preempts any normal task instantly, regardless of weights. Linux exposes three:

Class	Policy	Behaviour
`SCHED_FIFO`	1–99 priority	Runs until it blocks or yields. Same-priority tasks queue FIFO.
`SCHED_RR`	1–99 priority	Like FIFO but same-priority tasks round-robin every `sched_rr_timeslice_ms` (default 100 ms).
`SCHED_DEADLINE`	(runtime, deadline, period)	EDF — admission-controlled, gives each task a CPU budget per period.

SCHED_FIFO and SCHED_RR are blunt instruments. A FIFO thread that never yields will starve every normal task on its CPU forever. The kernel's sched_rt_runtime_us throttle (default 950 ms per 1 s) exists exactly because people kept shipping FIFO threads with infinite loops. SCHED_DEADLINE, added in 3.14, is the modern answer: the task declares a runtime and deadline, the kernel admits it only if it fits, and it gets exactly that much CPU per period, no more, no less.

The classic real-time pitfall is priority inversion: a high-priority task waits on a lock held by a low-priority task that's been preempted by a medium one. Linux's rt_mutex implements priority inheritance to break the cycle, which is why PTHREAD_PRIO_INHERIT mutexes are mandatory in any serious RT app.

Multi-core, NUMA, and migration.

Linux runs one runqueue per CPU. Wakeups go to a CPU chosen by select_task_rq_fair(), which considers idleness, cache topology, the waker's CPU, and NUMA distance. Periodically the load balancer (load_balance()) walks up the scheduling-domain hierarchy (SMT siblings → core complex → LLC → NUMA node) and moves tasks from busy queues to idle ones.

The central trade-off: migrating a task balances load but costs you the L1/L2 cache it built up on its old CPU, plus a TLB flush, plus possibly remote-memory access if the migration crosses NUMA nodes. On a modern Xeon a cold L2 can cost hundreds of cycles per miss for the first millisecond. CFS/EEVDF therefore prefer not to migrate when the imbalance is small, and the kernel tracks cache hot tasks (recently run on a CPU) and prefers to leave them in place.

CPU pinning vs migration. Pinning with taskset, sched_setaffinity(), or cpuset cgroups eliminates migration cost but disables the load balancer for that task. If its CPU is busy with kernel work or another pinned thread, it just waits. Pin only when you can give the thread the CPU exclusively (isolcpus + nohz_full), or for short-lived bursts where cache locality dominates. For general server workloads the default load balancer beats hand-tuning almost every time.

On NUMA systems the scheduler also considers memory locality. numa_balancing (enabled by default on multi-node hosts since 3.13) periodically samples a task's pages, records which node they live on, and migrates either the task or its pages to keep them co-located. It's the reason numactl --membind sometimes makes things slower: you've pinned the memory but not the thread, so the scheduler keeps moving the thread away from it.

cgroup v2 — how containers get their share.

Above the per-task scheduler sits the cgroup v2 CPU controller. It implements weighted-fair-queueing across cgroups: every cgroup has a cpu.weight (default 100, range 1–10 000), and within a parent, children share CPU in proportion to their weights. It's the same maths as nice values, applied to groups of tasks.

Hard ceilings come from cpu.max, a (quota, period) pair. 200000 100000 means the cgroup gets at most 200 ms of CPU per 100 ms wall window, i.e. 2 CPUs worth. This is what Kubernetes cpu: "2" compiles to. The catch is throttling: if the cgroup uses its quota in the first 40 ms of the period it sleeps for 60 ms, and tail latency spikes accordingly. See the containers deep-dive for the full chain from a Pod spec to cpu.max.

# cgroup v2 — weighted fair share across two services
$ cd /sys/fs/cgroup/myservice
$ cat cpu.weight
100

# Give web 4x the CPU share of batch under contention
$ echo 400 > /sys/fs/cgroup/web/cpu.weight
$ echo 100 > /sys/fs/cgroup/batch/cpu.weight

# Hard cap web at 2 CPUs (200 ms per 100 ms)
$ echo "200000 100000" > /sys/fs/cgroup/web/cpu.max

# Throttling stats — if nr_throttled is climbing, you're capped
$ cat /sys/fs/cgroup/web/cpu.stat
usage_usec 4523109832
nr_periods 9842
nr_throttled 312
throttled_usec 18204000

Latency-aware knobs.

Linux exposes several lower-key policies for cases where the default isn't quite right:

SCHED_BATCH — like normal but the scheduler assumes the task is non-interactive and won't try to give it a wakeup latency boost. Good for background analytics jobs that shouldn't compete with request-handling threads.
SCHED_IDLE — lowest possible priority, runs only when nothing else wants the CPU. Use for opportunistic work like indexers and prefetchers.
isolcpus=2-7 at boot removes CPUs 2–7 from the general scheduler's domain; nothing runs there unless explicitly pinned. Combined with nohz_full=2-7 and rcu_nocbs=2-7 you get tickless cores — the basis of every low-latency trading stack and DPDK pipeline.
irqaffinity in /proc/irq/N/smp_affinity pins hardware interrupts to specific CPUs. NIC RSS queues usually want their IRQs on the same CPUs as the userspace threads consuming them.
Kernel preemption modes — PREEMPT_NONE, PREEMPT_VOLUNTARY, PREEMPT, PREEMPT_RT. Mainline 6.x ships PREEMPT_DYNAMIC, switchable at boot. PREEMPT_RT (merged piecewise from 5.x onwards, mostly complete in 6.12) makes nearly every kernel codepath preemptible and turns spinlocks into sleeping mutexes.

Other operating systems.

Schedulers diverge sharply across kernels. Windows uses a priority-based SMP scheduler with 32 levels (0–31, the top 16 reserved for real-time) and short-term priority boosts on I/O completion, foreground focus, and starvation, a heuristic system that's existed largely unchanged since NT 3.5. macOS XNU is a hybrid: the Mach layer schedules kernel threads, BSD wraps them as processes, and a layered scheduler implements Apple's Quality of Service classes (USER_INTERACTIVE, UTILITY, BACKGROUND) on top.

FreeBSD's ULE scheduler, replaced 4BSD as default in 7.1, uses two per-CPU runqueues (current and next) plus an interactivity score that promotes I/O-bound threads, conceptually closer to early CFS than to EEVDF. BeOS/Haiku famously implemented per-thread pervasive multithreading and a scheduler optimised for desktop responsiveness; many of its ideas (per-CPU runqueues, cheap context switches) eventually found their way into Linux. Brendan Gregg's Systems Performance chapter 6 has a side-by-side comparison if you want the full picture.

Debugging scheduling.

Most scheduling pathologies show up as either uneven load (some CPUs pegged, others idle), migration storms (high sched.migrations), or runqueue contention (long wakeup latency). The standard kit:

perf sched record + perf sched latency — per-task wakeup and on-CPU times. perf sched map draws a per-CPU timeline.
schedviz (Google) and kernelshark — visual ftrace timelines, useful for finding the one thread that's blocking everything else.
chrt — read or set scheduling class and priority for a PID.
/proc/sched_debug — the kitchen sink. Per-CPU runqueue contents, vruntime, load weights, group hierarchy.
bpftrace one-liners on the sched: tracepoints — sched_switch, sched_wakeup, sched_migrate_task.

# Move PID 1234 to real-time round-robin priority 50
$ sudo chrt -r -p 50 1234

# Inspect current policy
$ chrt -p 1234
pid 1234's current scheduling policy: SCHED_RR
pid 1234's current scheduling priority: 50

# Run a command niced down so it loses every contention
$ nice -n 19 ./batch-job

# Per-CPU runqueue depths and migration counts
$ grep -E 'cpu#|nr_running|nr_migrations' /proc/sched_debug | head -40

# Live wakeup-latency histogram (eBPF)
$ sudo bpftrace -e 'tracepoint:sched:sched_wakeup { @[comm] = hist(nsecs - args->prev_state); }'

Production mistakes.

The same four or five patterns account for most scheduling-related incidents:

Oversubscribed containers with no CPU limits. A noisy neighbour with cpu.weight = 100 can hog every CPU on the node and starve everything else of latency, even though top shows no throttling. Always set both weights and caps on shared nodes.
SCHED_FIFO threads that never yield. An infinite loop at FIFO 99 will lock up a CPU until sched_rt_runtime_us throttles it, and if the operator turned the throttle off ("for latency"), the only way out is the watchdog. Always pair RT priority with a sleep or condition variable.
cpuset.cpus.partition misconfig. cgroup v2 partitions let you carve out exclusive CPU sets, but a typo'd root partition can steal CPUs from the system root and brick everything not in the partition. Validate with cat /sys/fs/cgroup/cpuset.cpus.effective before declaring victory.
NIC IRQs all on CPU 0. The default IRQ affinity on most distros pins every NIC queue to CPU 0; under load that one CPU softirq-spins at 100% while the others coast. Fix with set_irq_affinity.sh from the NIC vendor or the irqbalance daemon's hints.
Pinning without isolation. Pinning a thread to CPU 5 doesn't stop the kernel from scheduling other tasks there. For dedicated cores you need isolcpus=5 at boot or a cpuset cgroup with cpuset.cpus.partition=isolated.

Scheduling

More runnable threads than CPUs.

What the scheduler actually does.

Preemptive vs cooperative, and the timer tick.

Slices, quanta, and starvation.

Priorities and nice.

Classic algorithms: round robin to MLFQ.

What a context switch costs.

CFS — the completely fair scheduler.

EEVDF — the 6.6+ replacement.

Real-time classes — FIFO, RR, DEADLINE.

Multi-core, NUMA, and migration.

cgroup v2 — how containers get their share.

Latency-aware knobs.

Other operating systems.

Debugging scheduling.

Production mistakes.

Further reading.

04 — Memory management