How Docker Actually Works

A "container" is not a thing. It is a regular Linux process the kernel started with clone() and a specific set of namespace flags, dropped into a cgroup, and given an overlayfs root. Click run to watch one being built, primitive by primitive. Then toggle the primitives off and watch the isolation collapse.

namespaces
6/7
mem
24/100M
cpu throttled
0%

mode: pressure:
namespaces:
overlayfs (4 lowerdirs + 1 upperdir):
upperdir (RW) writable; copy-up on write; lost on container delete
COPY ./app /app RO · 6 MB · sha256:1b88de…0c33
pip install -r req.txt RO · 212 MB · sha256:f4d2c8…9a17
python:3.12 runtime RO · 128 MB · sha256:7c1a90…44de
debian:bookworm-slim RO · 74 MB · sha256:e9adb5…b21f
status idle
mem
24 / 100 M
cpu
8% (quota 20%)
visibility leak: 1 namespace off
  • user — UID 0 inside is UID 0 outside; escape = root on the host
— click "docker run" to build the container —

What you're looking at

The growing list of steps is one container being assembled in four phases: stack the read-only overlayfs layers under a writable upper layer, call clone() with the namespace flags you left on, write the cgroup limits, then drop capabilities and execve the entrypoint. The chips above are the seven namespaces; the layer stack shows the image; the status bars track memory and CPU against their cgroup ceilings; the leak panel lists whatever isolation you switched off.

Run it once with all seven namespaces on, then turn off pid and run again — the leak panel warns that the container can now see every process on the host. Click +32M mem until the bar hits the limit and the status flips to oom-killed, even though the host has plenty of RAM free; that kill comes from the cgroup, not the machine. The point that should land is how little a container really is: switch the mode to chroot and almost nothing is isolated but the filesystem root, and switch to vm and the whole namespace row goes away because a second kernel is doing the work instead.


Containers are processes with extra flags

Once you know which kernel features Docker is stitching together, the rest of the ecosystem stops looking like magic.

Mechanically, docker run resolves to a sequence of three or four kernel calls. A clone() with CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWCGROUP (and optionally CLONE_NEWUSER) creates a child task in a fresh set of Linux namespaces. A few write()s to /sys/fs/cgroup/<name> put that task under memory, CPU, IO, and PID limits. A mount -t overlay stacks the image layers underneath a writable upper directory and pivot_root makes it the task's new /. A seccomp BPF filter is loaded, a chunk of capabilities is dropped, and finally execve() replaces the process image with the entrypoint.

Every piece is years older than Docker. Plan 9 had per-process mount namespaces in the early 1990s. Linux added namespaces one at a time between 2002 (mnt) and 2016 (cgroup ns). Google's cgroups patches landed in 2007. AUFS, then OverlayFS (mainline in 3.18, 2014), gave you the layer cache. Docker's contribution was packaging — a CLI, an image format with content-addressable layers, and a registry protocol — on top of LXC, then on top of its own runtime libcontainer, which became runc, which became the OCI runtime standard. containerd and CRI-O are two implementations of the same idea sitting underneath Docker and Kubernetes respectively. Podman skips the daemon entirely and shells out to runc directly.

Hold the picture: the kernel does the work; the runtime is glue. If you understand namespaces and cgroups, you understand docker, nerdctl, podman, systemd-nspawn, and most of what Kubernetes does to a node. They are all different command-line front-ends to the same kernel primitives.


Overlayfs and the layer cache

Why your Dockerfile rebuilds the world when you change a comment in app.py.

An image is not a tarball. It's a manifest pointing at an ordered list of layer tarballs, each addressed by the SHA-256 of its content. docker pull fetches the layers it doesn't already have and dumps each one into /var/lib/docker/overlay2/. To start the container, the runtime stacks those layers as lowerdir= entries in a single mount -t overlay call, with a fresh upperdir on top for the container's writes. Read a file: overlayfs walks layers top-to-bottom and returns the first hit. Write a file: overlayfs copies it up to the upper layer (copy-up) and writes there. Delete a file in a lower layer: overlayfs records a whiteout in the upper, hiding it.

Each Dockerfile instruction produces one layer. The build cache keys a layer on the instruction text and the SHA of its inputs. Change the comment in app.py, and the COPY . /app layer's input hash changes, which busts the cache, which busts every layer after it. The standard fix — COPY package.json / then RUN npm install then COPY . /app — exists exactly to keep the expensive layer (npm install, which downloads gigabytes) above the cheap layer (the source copy, which changes every commit) so the expensive layer keeps cache-hitting. This isn't a Docker quirk; it's content-addressable storage with append-only semantics.

There are real limits. Overlayfs has a hard cap of 128 lower directories (which is why tools like buildkit squash layers). Performance on writes to large files is poor — the whole file gets copied up on first write — so databases inside containers should usually live on a bind-mounted host volume, not the container filesystem. And the content-addressable design means that two images that FROM debian:bookworm share the base layer on disk and in the page cache for free, which is the secret to why running fifty containers from related images uses almost no extra memory.


cgroups are the new ulimit

Resource limits stopped being per-process in 2007 and almost nobody noticed.

Before cgroups, the kernel's resource-limit primitive was setrlimit(2): one set of caps per process, inherited across fork. Useful for "stop this PHP script if it eats a gigabyte"; useless for "this Postgres cluster, all 14 worker processes, must collectively stay under 8 GB." cgroups (control groups, originally process containers from Google) fixed that by letting you put N processes into a named group with shared limits — memory, CPU bandwidth, IO bandwidth, PIDs, network bytes. Every container runtime puts each container in its own cgroup and writes the limits there.

cgroups v1 had separate hierarchies for each controller (memory, cpu, blkio, etc.), which made consistent accounting impossible — a process could be in different positions in each tree. v2 unified them into a single hierarchy and is what every modern distro defaults to. When you see memory.max=100M on a cgroup and your container's RSS plus page cache crosses that line, the cgroup OOM-killer fires and shoots a process inside that cgroup. Crucially, the host's OOM-killer is untouched; the host has plenty of RAM, but your container's process has accounting that crossed its line. This is why "my container got OOM-killed but the node has 80 GB free" is not a paradox.

Kubernetes layers another cgroup tree on top: a pod becomes a parent cgroup, each container becomes a sub-cgroup, and the kubelet's QoS classes (Guaranteed, Burstable, BestEffort) map to different cgroup parents with different oom_score_adj values so the right pods get killed first under node pressure. The same kernel primitive — memory.max — is doing all the work. CPU limits go through the CFS bandwidth controller: cpu.max "20000 100000" means 20 ms of CPU per 100 ms period. Burst beyond and you get throttled, which shows up as throttled_usec in cpu.stat and as tail-latency spikes in your service.

What containers don't isolate

The list of things namespaces don't cover is longer than the list of things they do.

Containers share the kernel. They share the page cache (a hot file in one container is hot for every container). They share the system clock; CLOCK_MONOTONIC drift is impossible because there's only one clock. They share the kernel's random pool. By default they share the host's UID space, which is why container UID 0 is host UID 0 unless you turn on user namespaces — and user namespaces are off in Docker by default because the UID-mapping semantics break volume mounts and a non-trivial fraction of container images assume UID 0 means root-as-root.

Container escape is almost always a kernel bug. Dirty Pipe (CVE-2022-0847) let any container write into read-only files including the host's passwd. The runc CVE-2019-5736 let a container overwrite the runc binary on the host and capture the next exec. --privileged containers get the host's device nodes and can mount the host filesystem inside themselves; never run untrusted workloads privileged. Mounting the Docker socket inside a container is, in security terms, handing that container root on the host — the daemon has root and the socket is the API.

The response to "the kernel attack surface is too big" comes in two shapes. gVisor (Google, 2018) reimplements much of the Linux syscall surface in a userspace process called Sentry, so the container's syscalls hit Sentry rather than the host kernel — fewer privileged code paths, real performance cost. Kata Containers and Firecracker (AWS, the engine under Lambda and Fargate) wrap each container in a stripped-down VM with its own tiny kernel — true hardware isolation, ~125 ms cold start, fits the multi-tenant serverless threat model. The trade is the same one VMs always made: more memory, slower start, but the blast radius of a kernel CVE shrinks back to one tenant.

Found this useful?