Day-0 → Month-5 · curriculum
Study path · Operating systems

Operating systems,
from the ground up.

An operating system is a thin layer that lies to your code. It pretends each program owns the whole machine: infinite memory, dedicated CPUs, exclusive disks. Each section here takes one of those lies apart, shows the truth underneath, and explains why the abstraction was worth it. The mental models, OSTEP, and the labs that get you there.


Why the kernel matters.

Three problems, repeated forever. Sharing: many programs, one set of CPUs, one slab of RAM, one disk. The OS multiplexes both across time and space. Isolation: each program must believe it owns the machine, and one program's bug must not corrupt another. Abstraction: programs should not have to know whether the disk is a platter, an SSD, or a network filesystem. The file descriptor is the unifying layer.

Unix solved all three with a small set of primitives — process, file, fork, exec, the stream of bytes. Fifty-three years later the same primitives run hyperscale clouds. Linux added namespaces and cgroups for stronger isolation; io_uring and eBPF for cheaper abstraction; PREEMPT_RT for harder real-time. The shape stayed.

Read the kernel, you'll think differently about your code. Most of what you can change in user-space looks small once you've seen what the kernel does on every syscall. Profilers stop being mysterious. Latency budgets stop being arbitrary. "the OS is slow" becomes "I am holding the OS wrong".

Twelve mental models.

Twelve concepts cover ~95% of OS surface. Get these in your bones in the first month. Every kernel feature you meet (containers, eBPF, io_uring, KVM) is a recombination of them.

01 Process Day-zero

A program in execution: an address space, one or more threads, file descriptors, identity, and accounting. The OS unit of isolation. fork() creates one; exec() replaces its image; the PID stays.

02 Thread Day-zero

A flow of execution sharing the address space with siblings. Cheap to create (10–50 µs), but raises the contract from "this code runs" to "this code interleaves". 1:1 (Linux) vs M:N (Go, Erlang).

03 Virtual memory Day-zero

Each process sees a private 128-TB address space. The MMU translates page-by-page via four-level page tables, cached in the TLB. Demand-paged: pages aren't real RAM until you touch them.

04 Page table & TLB Practitioner

CR3 → PML4 → PDPT → PD → PT → frame. Four memory accesses per translation, cached in a tiny on-chip TLB. PCIDs let TLB entries survive process switches; huge pages let one entry cover 2 MB.

05 Scheduling Practitioner

Many runnable threads, few CPUs. Linux ran the O(1) scheduler 2003–07, CFS 2007–24, EEVDF since. Each tracks vruntime and balances per-CPU runqueues. Real-time classes (FIFO, RR, DEADLINE) sit above.

06 Syscall boundary Practitioner

The single doorway from ring 3 to ring 0. ~100–1000 ns each on modern x86 via the SYSCALL instruction. The vDSO maps read-only kernel data into user-space so clock_gettime is 50× faster.

07 File descriptor Day-zero

A small integer indexing into the per-process FD table. Files, sockets, pipes, devices, signalfd, eventfd, epollfd — all reach you through one. The most successful abstraction in Unix.

08 Page cache Practitioner

Files live in RAM until the kernel evicts them. read() copies from cache; write() dirties cache pages flushed later. mmap maps cache pages directly. Tuned via vm.dirty_ratio + vm.swappiness.

09 Synchronization Practitioner

Mutexes (futex-backed), condition variables, semaphores, atomics with memory ordering. Race conditions, deadlocks, livelocks, priority inversion — the four hazards. Lock ordering or try-with-backoff.

10 IPC Operator

Pipes, FIFOs, Unix domain sockets (with SCM_RIGHTS for FD passing!), shared memory, signalfd / eventfd, POSIX message queues. Cross-host = TCP / QUIC / NATS / Kafka.

11 epoll & io_uring Operator

epoll: register FDs once, get back only the ready ones — O(active). io_uring: submit hundreds of operations through a shared-memory ring, harvest completions later. The substrate of every modern high-throughput server.

12 Namespaces & cgroups Operator

Namespaces give each container its own view of PIDs, mounts, network, UTS, IPC, users. cgroups v2 enforces CPU, memory, and IO quotas hierarchically. Together: the building blocks of every container runtime.

Day zero — first hour.

One hour. Read OSTEP chapters 4 (the abstraction: process), 13 (address spaces), and 26 (concurrency, an introduction). Then strace a simple program, watch every syscall, and follow one major page fault. The bar is muscle: read the right OSTEP chapters, then watch the kernel react in real time.

# 1. Read OSTEP ch. 4, 13, 26 (≈ 60 minutes)
#    https://pages.cs.wisc.edu/~remzi/OSTEP/

# 2. Pick a tiny C program (or write one)
cat > hello.c <<'EOF'
#include <stdio.h>
#include <unistd.h>
int main(void) { printf("hi from pid %d\n", getpid()); sleep(1); return 0; }
EOF
cc hello.c -o hello

# 3. Watch every syscall
strace -e trace=execve,openat,mmap,brk,write,exit_group ./hello

# 4. Watch page faults on a heavier program
/usr/bin/time -v ./your-actual-program 2>&1 | grep -E 'page faults|context switches'

# 5. Read your own /proc — pick a long-running process (e.g. your shell)
cat /proc/$$/status     # state, threads, RSS, VSZ
cat /proc/$$/maps       # virtual address layout
cat /proc/$$/limits     # rlimits

# 6. Trace a system-wide event for 30s with bpftrace (optional)
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat
  { printf("%s -> %s\n", comm, str(args->filename)); }'

Done. You have read the right three chapters, watched a real syscall trace, inspected /proc, and (optionally) used eBPF to watch the system in flight. Everything below builds from here.

Week 1 to Month 3 — pick a track.

After the first hour you can read OS writing without bouncing off it. Spend the next three months on one track at a time, depth-first. Don't try to learn schedulers and file systems in the same fortnight. Pick the one that maps to your job and finish it.

The process model

fork, exec, wait, signals, zombies, orphans, setsid. Read Stevens APUE chapters 7–10; trace a shell pipeline with strace. Build a tiny shell as a weekend project — once you have it, fork/exec/dup2/wait stop being abstractions.

→ Reference
Memory & the MMU

Address spaces, page tables, the TLB, demand paging, COW, swap, NUMA. Read OSTEP "Virtualization" part. Inspect /proc/PID/maps, /proc/PID/smaps, /proc/PID/pagemap on a running process. Watch the TLB hit rate with perf stat -e dTLB-loads,dTLB-load-misses.

→ Reference
Schedulers

CFS, EEVDF, real-time classes, cgroup throttling. Read the kernel's sched-design-CFS doc; profile a Kubernetes pod under quota. The classic production pathology is CFS throttling at the cgroup boundary; the fix is rarely "more CPU".

→ Reference
File systems & I/O

Inodes, dentries, the page cache, fsync, journaling vs COW (ext4 vs ZFS/btrfs). Read the LFS paper; mount a few filesystems and compare. fio for benchmarks, strace -e openat for visibility.

→ Reference
Concurrency primitives

Mutexes, condvars, semaphores, atomics, memory ordering, futex. Read Mary Lynn Manns / Erlang the C++20 atomics ref; read Paul McKenney's Is Parallel Programming Hard. Build a SPSC ring buffer — once it's correct, you understand acquire/release.

→ Reference
Networking inside the kernel

sockets, TCP state machine, epoll, io_uring, eBPF, XDP, DPDK. Read Beej's sockets guide as a refresher; then the kernel networking docs and one Cloudflare engineering post on XDP.

→ Reference
Containers & cgroups

Namespaces (PID, mount, net, UTS, IPC, user, cgroup, time); cgroups v2 (CPU, memory, IO). Read Liz Rice's "Containers from Scratch" talk; build your own container with unshare / clone / pivot_root.

→ Reference

Books worth reading.

2018 · free online
Arpaci-Dusseau — Operating Systems: Three Easy Pieces (OSTEP)

The right modern OS textbook. Three parts — virtualization (CPU and memory), concurrency, persistence. Free online; print-friendly PDF; the homework is real. Start here.

2010 · Addison-Wesley
Robert Love — Linux Kernel Development (3rd ed.)

The most readable book on the Linux kernel. Slightly dated (kernel 2.6) but the core abstractions haven't moved. The book that turns "I know syscalls" into "I read kernel source for fun".

2020 · Pearson
Stevens, Rago — Advanced Programming in the UNIX Environment (3rd ed.)

APUE. The C-and-Unix bible. Files, processes, signals, threads, terminals, sockets — at the level of "here is the syscall, here is the man page, here are five real examples". Reach for it whenever a syscall surprises you.

2020 · Addison-Wesley
Brendan Gregg — Systems Performance (2nd ed.)

The methodology book for performance work on Linux. USE method, RED method, flame graphs, off-CPU analysis. The book that turns "the server is slow" into a finite checklist.

2021 · self-published
Paul E. McKenney — Is Parallel Programming Hard, And, If So, What Can You Do About It?

McKenney's perfbook. Free online. The most rigorous treatment of concurrency outside academic textbooks — RCU, hazard pointers, memory ordering, locklessness. Re-read every year.

2009 · O'Reilly
Kerrisk — The Linux Programming Interface

TLPI. Michael Kerrisk — maintainer of the Linux man-pages project — writes the book on Linux syscalls. Encyclopaedic; pairs with APUE for "Linux specifically" detail.

2003 · Wiley
Tanenbaum, Bos — Modern Operating Systems (5th ed.)

The canonical undergraduate textbook. Drier than OSTEP but broader — distributed OS, security, virtualization, mobile OS. Reach for it when OSTEP's cheerful tone wears thin.

Honourable mentions: Understanding the Linux Kernel (Bovet, Cesati — older but still excellent on the VM subsystem); Linux System Programming (Robert Love — the user-space companion to LKD); The Design of the UNIX Operating System (Bach — historical but still beautifully clear on the original architecture).

Courses and references.

Free
Paid (worth it)

Papers worth reading.

Twelve papers, roughly 1965 → 2019. Read them in order. The field is dense and citation- heavy. Most are 10–25 pages.

  1. 01
    1965 · Dennis & Van Horn
    Programming Semantics for Multiprogrammed Computations

    The paper that gave us the process abstraction. Read it for the historical frame: "what is a process" was a research question once. Twenty pages.

  2. 02
    1973 · Lampson
    A Note on the Confinement Problem

    Lampson's covert-channel paper. The first formal treatment of "what does the OS keep secret, and from whom". The vocabulary of seven covert-channel categories is still the right one in 2026.

  3. 03
    1974 · Ritchie & Thompson
    The UNIX Time-Sharing System

    The Bell Labs CACM paper. Twelve pages explaining files, processes, the shell, the directory tree, pipes. Every Unix design decision in the field traces back to this; read it before any other OS paper.

  4. 04
    1992 · Rosenblum & Ousterhout
    The Design and Implementation of a Log-Structured File System

    LFS — write the entire disk as a circular log. The architectural ancestor of every modern flash-aware filesystem (F2FS, the FTL inside SSDs themselves). Read it for the disk-as-log mental model.

  5. 05
    1995 · Bonwick
    The Slab Allocator: An Object-Caching Kernel Memory Allocator

    Bonwick's slab allocator from Solaris. The Linux SLUB allocator descends directly. Read for the "object pools matter" insight that drives every kernel memory allocator since.

  6. 06
    1996 · Cao, Felten, Karlin, Li
    Implementation and Performance of Application-Controlled File Caching

    The classic paper on letting applications hint at the page cache. madvise, posix_fadvise — they all trace here. Read before tuning vm.* anywhere in production.

  7. 07
    2004 · Bershad et al
    Lightweight Recoverable Virtual Memory

    The mmap-and-checkpoint pattern. The intellectual ancestor of every modern persistent-memory and copy-on-write database. Worth reading before reaching for kernel byte-addressable persistent memory APIs.

  8. 08
    2010 · McKenney
    Memory Barriers: A Hardware View for Software Hackers

    McKenney is the author of RCU. This paper explains memory barriers, store buffers, and cache coherence from the hardware side. Read it once and the C++/Rust acquire/release vocabulary stops being mysterious.

  9. 09
    2014 · Anderson, Dahlin
    Operating Systems: Principles & Practice (the OS:PP textbook)

    Not a paper, but the text. Modern OS textbook from Tom Anderson; pairs with OSTEP. Particularly strong on synchronisation and threads. Free PDF chapters online.

  10. 10
    2017 · Bonifaci, Brandenburg, Stiller, Wieder
    Counting on Fast Userspace Mutexes (futex revisited)

    A revisit of futex semantics with priority inheritance. If you operate latency-critical code, this is required reading; if you don't, it's a beautiful narrow paper to feel smart about.

  11. 11
    2019 · Axboe et al
    io_uring: An Introduction (kernel docs)

    Jens Axboe's introduction to io_uring. Submission and completion rings, polling mode, fixed buffers. The substrate every new high-throughput Linux server in 2020+ is built on.

  12. 12
    2020 · Gregg
    BPF Performance Tools

    Brendan Gregg's comprehensive book on eBPF for production observability. opensnoop, execsnoop, tcpconnect, profile — every diagnostic tool reduced to a one-liner. Pair with his earlier Systems Performance book.

Going further: Lampson’s "Hints for Computer System Design"; the Mach paper (Accetta et al, 1986); the seL4 verification paper (Klein et al, 2009); the Singularity OS papers from MSR; Solaris’ DTrace (Cantrill, 2004) — the conceptual ancestor of eBPF.

Talks worth watching.

Hands-on tools.

Theory without something you can run is fragile. Each of these is a manageable way to make the kernel push back when you make a mistake.

EnvironmentCostBest for
xv6Free, open-sourceMIT 6.S081’s teaching Unix. ~10k lines of C. Compile, boot in QEMU, modify the scheduler / shell / file system. The most direct way to feel a kernel respond to your changes.
strace + ltraceFreestrace traces syscalls; ltrace traces library calls. The tools every Linux engineer reaches for first when "why is this program slow / failing / weird". Read the man page once; you’ll use them forever.
perfFree, in-treeThe kernel’s native profiler. perf top for live; perf record + perf report for post-hoc; perf stat for hardware counters. The tool that turns "the box is slow" into "this function is the bottleneck".
bpftrace + BCCFree, open-sourceeBPF for the rest of us. Brendan Gregg’s tools: opensnoop, execsnoop, tcpconnect, runqlat. Production-safe, zero-instrumentation observability. Run as root; learn five tools; replace half your debugging.
QEMU + your own kernelFreeBuild a custom kernel from the source tree (make defconfig + make -j$(nproc)); boot it in QEMU. The "I made one change and watched it work" loop is half a day of setup and infinite reps after.
cyclictest + stress-ngFree, open-sourceLatency benchmarks for real-time scheduling. cyclictest measures interrupt latency; stress-ng generates load. The right tools to evaluate PREEMPT_RT, isolated CPUs, or any "is this kernel real-time enough" claim.

Latency, at a glance.

Twelve numbers, calibrated for modern hardware. Print this and tape it next to the monitor. The ones that surprise people most: a syscall costs ~10× a function call, an L2 miss costs ~100× an L1 hit, and a major page fault costs ~10,000× a minor one.

Operation Latency Notes
L1 cache reference ~1 ns Cache hit, no stall.
Branch misprediction ~3 ns Pipeline flush; small but in tight loops it dominates.
L2 cache reference ~4 ns Still on-chip.
Mutex lock/unlock (uncontended) ~25 ns Modern Linux futex fast path.
Main memory reference ~100 ns Tens of cycles — feels free, isn't.
Empty syscall (getpid) ~100–300 ns The mode-switch cost; vDSO bypasses it for clock_gettime.
Context switch (same process) ~1 µs No CR3 change; cheap.
Context switch (cross-process) ~3–5 µs TLB flush amortised by PCID.
NVMe random read (4 KB) ~10–100 µs Fast SSDs; same order as a context switch.
Local network round trip ~50 µs – 1 ms Same datacenter, modern NICs.
Page fault from disk (major) ~ms Major page fault; visible in /proc/PID/stat.
Cross-region network ~30–300 ms Architecturally expensive; design around.

Numbers are order-of-magnitude on a 2024-class x86 server. Always measure on your own hardware. Jeff Dean’s "Latency Numbers Every Programmer Should Know" — last updated by colinscott — is the canonical scaffold.

Common mistakes.

Patterns every team writes at least once. Read these now and you'll recognise the shape later, when something on-call is misbehaving and the dashboard is no help.

Forgetting that fork() is COW
A 4 GB process forks; the engineer thinks "we just doubled RAM". No — the child shares pages until it writes. Fork is microseconds. The actual mistake is calling exec() too late, holding lots of dirty pages.
Treating threads as free
A thread is 8 MB of virtual address space, ~16 KB of kernel state, plus context-switch cost. 100k threads on Linux works (it's designed for it), but unbounded thread pools deadlock under load. Cap the pool; queue the rest.
Blocking the runtime's event loop
A synchronous read inside an async runtime (Node, Tokio, Go without proper isolation) can stall the whole event loop. Either move the work to a worker pool or use the runtime's blocking-syscall offload.
Ignoring CFS throttling
Kubernetes CPU limits are enforced by cgroup CFS bandwidth control. A burst hits the quota; the cgroup is frozen for the rest of the period (default 100 ms). Symptoms: tail-latency spikes correlated with the period. Fix: raise the limit, remove it, or use SCHED_DEADLINE.
fsync paranoia (or lack thereof)
A successful write() puts bytes in the page cache, not on disk. fsync flushes them. fsync on the parent directory after rename is required for the rename to survive a crash. Most "we lost data after a power cut" bugs trace here.
Misusing /dev/urandom vs getrandom
Old code reads /dev/random and blocks at boot waiting for entropy. Modern code calls getrandom(2) (default flags) — it does the right thing on every kernel since 3.17. Don't ship blocking entropy reads in 2026.
Storing config in env without size limits
execve has a hard limit on argv+envp size (~128 KB on Linux). Container orchestrators happily inject 200 KB of env vars, and your fork/exec starts failing with E2BIG. Cap env-var sizes in the config layer.
Treating mmap as faster I/O
mmap looks like memory; underneath it's page faults that issue disk I/O. For sequential reads of large files, read() with a sane buffer is often faster (no fault-per-page overhead). Profile.
Writing to /tmp without considering tmpfs
/tmp is RAM-backed (tmpfs) on most distros now. A 50 GB write to /tmp can OOM the box. Use /var/tmp or an explicit on-disk path for large temp files.
Reaching for a thread when a process would be safer
A bug in a thread takes down its siblings; a bug in a process takes down only itself. For untrusted code, hot-reload, or "I want a hard isolation boundary", processes are the right tool — even at their higher cost.

Quick test.

Ten cards: the questions interviewers ask, the things that bite operators in production, and the trivia that separates "I run Linux" from "I understand it".

Card 1 of 10
Why does fork() of a 4 GB process cost microseconds, not seconds?
Suggested sequences

Reading progressions

Three ordered paths through this material. Pick the one that matches where you are.

Path 01 · Processes
Processes, threads & scheduling

How the OS multiplexes CPU time between tasks, from process lifecycle to thread pools.

  1. Thread Pools — OS thread model
  2. Event Loops — single-threaded alternative
  3. Goroutine Scheduler Simulator ↗
  4. CPU Cache Simulator ↗
Path 02 · Memory
Virtual memory & allocation

From the MMU to malloc: how memory is virtualised, managed, and eventually collected.

  1. Memory Allocation — allocator internals
  2. Garbage Collection — GC algorithms
  3. CPU Cache Simulator ↗
  4. Go Channels — stack vs heap
Path 03 · I/O
I/O, storage & networking

How the kernel handles blocking I/O, file systems, and the interface to the network stack.

  1. Ring Buffer — kernel I/O queues
  2. WAL — durable sequential write
  3. WebSockets — long-lived I/O
  4. The Networking Stack — codex

What's next.

Operating systems reward re-reading. OSTEP, read on day 30 and again on day 300, will give you different things. So will Robert Love’s LKD. So will every Brendan Gregg talk. The field is not large. It is dense, and it has been compounding for fifty years.

Pick one real kernel subsystem and read its source for an afternoon. The Linux scheduler (kernel/sched/), the page-cache (mm/filemap.c), the futex implementation (kernel/futex/) are all open. Pair what you read with the paper that inspired it. Then come back to your own code, your own profiles, your own slow paths. You will rewrite some of them.