What is the difference between a container and a virtual machine?

A VM runs a complete guest operating system on top of a hypervisor; a container runs as a process on the host operating system, isolated by kernel features (namespaces, cgroups). VMs are heavier (gigabytes of RAM, minutes to start) but isolate at the hardware boundary. Containers are lighter (megabytes, milliseconds to start) but share the host kernel — a kernel exploit affects all containers on the host.

OCI (Open Container Initiative) is the standard that defines what a container image is and what a container runtime should do. Three artifacts: the image spec (the layered tarball format), the runtime spec (config.json + rootfs), and the distribution spec (how registries serve images). Docker, Podman, containerd, and CRI-O all implement OCI.

New to this? · ELI5 · 1 min Read Containers (Docker) explained simply, in plain English

11 min read · Guide · Linux

How it works · Linux · Isolation

How a container is just a process with its own view of the system.

Not a virtual machine. Not a process. A regular Linux process with eight kinds of namespace and a few cgroup limits. The illusion of isolation is built from primitives the kernel has had for years.

Parts01 – 11 InteractiveNamespace + cgroup picker PrereqLinux processes

What is a container?

Two kernel features, that's it.

A container is a process running on a host kernel, isolated from other processes by Linux namespaces (PID, network, mount, UTS, IPC, user, cgroup, time) and resource-bounded by cgroups. Docker (2013), then the OCI standard (2015), made containers ubiquitous; Kubernetes made them schedulable. Containers are not VMs — they share the host kernel.

A container is a regular Linux process. It runs on the same kernel as everything else on the host. What makes it look isolated is two kernel features applied to that process at startup: namespaces (which kernel state the process can see) and cgroups (which resources it may consume).

That's the whole core. The fancy parts — Docker, containerd, OCI — are conventions about how to set up the namespaces, package the filesystem, and ship the image. Underneath, every "container runtime" is calling the same handful of clone(), unshare(), and setns() syscalls.

Namespaces: eight ways a process is isolated

Eight axes of isolation.

Each namespace isolates one kernel resource. Pick a namespace below to see what it changes inside the container.

net namespace

network stack — own interfaces, own routing table, own iptables; veth pair connects to host bridge

unshare --net bash # new net namespace, fresh shell
# inside, the kernel reports a new view of net state

cgroups: how the kernel caps what a container uses

Six controllers, enforced by the kernel.

Namespaces hide; cgroups limit. A namespace isolates the view; a cgroup bounds the resource consumption — CPU shares, memory ceilings, block I/O. Together they make a container a citizen the host can predict.

cgroup v2 vs v1. The original cgroup v1 (kernel 2.6.24, 2008) used a separate hierarchy per controller — you could put a process in different cgroups for CPU and memory, which made it expressive but operationally fragile. cgroup v2 (kernel 4.5, 2016, default in modern distros since 2021) unifies them: one tree, one cgroup per process across all controllers. Kubernetes 1.25 made cgroup v2 the default; Docker 20.10+ supports both. The memory.high attribute in v2 (soft limit, throttle but don't kill) is a major operational improvement over v1's hard-only limits.

Real numbers. A typical Kubernetes pod spec sets resources.limits.cpu: 1000m (one full core) and resources.limits.memory: 2Gi. The kubelet translates this to cpu.max=100000 100000 (CFS quota: 100ms of CPU per 100ms period) and memory.max=2147483648 in the cgroup. When the container exceeds memory.max it gets OOM-killed by the kernel; when it exceeds CPU it gets throttled (run at slower wall-clock pace, no kill). The choice between throttle and kill is exactly why memory limits sting more than CPU ones.

The CFS quota throttling problem. A famous container gotcha: a process that bursts to 100% CPU for 50ms, then idles 50ms, runs fine on a bare host but gets throttled inside a 1-CPU cgroup because CFS counts wall-clock time across all threads. A burst of 8 threads at 100% for 12.5ms uses up the 100ms quota in one tick. Production fix: either set --cpu-period=10ms (smaller quota window) or grant more CPU than the average load suggests. Many production teams have a folklore "CPU limits are evil" rule born from this.

memory.max — "OOM kill yourself at 512 MiB"

Container filesystems: layered and copy-on-write

Layered filesystem, copy-on-write.

A container image is a stack of read-only filesystem layers plus a writable top layer — cached by hash on every host that pulled them once. Each Dockerfile instruction (RUN apt install, COPY src) creates a new layer; the layers below are shared with every other image that uses them. That's why FROM ubuntu images all share most disk space.

At runtime, an overlay filesystem stacks the layers; writes go to the top (writable) layer using copy-on-write. Two containers from the same image share their entire read-only stack until one writes, at which point the changed file is copied into that container's top layer.

OCI: the standard that makes images portable

A spec, not a runtime. Three artifacts.

The Open Container Initiative defines three specs that decouple "what runs the container" from "what builds the image": the image format, the runtime spec (what bundle a runtime accepts and how it should configure namespaces / cgroups / mounts), and a distribution API (how registries push and pull). Docker's image format is OCI's. Podman, containerd, cri-o all use OCI runtimes (typically runc).

This is what made Kubernetes possible: pluggable container runtimes through a single interface. Replace runc with kata (VM-isolated), gvisor (userspace kernel), or youki (Rust port) — same image, same config, different sandbox properties.

Containers vs VMs: same kernel, different boundary

Same kernel, different boundary.

A virtual machine boots a full operating system on virtual hardware — its own kernel, drivers, init system. Heavyweight, but the security boundary is the hypervisor, which is small and well-audited.

A container shares the host kernel. Lightweight (start in milliseconds, MB of overhead vs GB), but the security boundary is the kernel itself — a much larger surface. Container escapes have happened (the runc /proc/self/exe overwrite, CVE-2019-5736; the runc leaky-fd escape, CVE-2024-21626). For untrusted code, a VM-level isolation (Kata, Firecracker, gVisor) layered under the container API gets you the best of both — fast like containers, isolated like VMs.

What containers do not solve

It is still a process on a shared OS.

A container is a process. It still suffers from noisy-neighbour effects on the host (CPU steals, I/O contention) unless cgroups are tuned aggressively — autoscaling works around the symptoms but doesn't fix the cause. Persistent state needs explicit volume mounts; the writable layer is gone when the container exits. Logging and metrics need pipes to host-side collectors. Secrets belong outside the image, not baked in.

If you've seen "it works on my machine" become "it works in my container" become "it stops working in production," that gap is usually one of the four above.

Container security and the escape problem

The kernel is a big surface.

The container security boundary is the Linux kernel — every container shares it. Compared to a VM (whose boundary is the hypervisor, ~50,000 lines of code), the kernel is ~30 million lines and a much larger attack surface. Notable public container-escape CVEs:

CVE-2019-5736 · runc: An attacker with write access to the host's runc binary could overwrite it via /proc/self/exe. Patched in days; affected almost every container runtime.
CVE-2022-0185 · Linux kernel: Heap overflow in fs/fs_context.c. A privileged container could trigger it to escape to the host. Fixed in 5.16.2.
CVE-2024-21626 · runc "Leaky Vessels": File descriptor leak let a container's WORKDIR reference escape into the host filesystem. Affected Docker, containerd, Podman.
CVE-2022-23648 · containerd: Symlink-following bug let a container read arbitrary files from the host via volume mounts.

Defense in depth. Don't rely on the kernel boundary alone. Layer: seccomp profiles (whitelist syscalls — Docker's default blocks ~50 of them); AppArmor or SELinux (mandatory-access-control); read-only root filesystem (--read-only); drop capabilities (--cap-drop=ALL); non-root user in the image (USER 1000); no privileged mode ever. CIS Docker Benchmark and Kubernetes Pod Security Standards encode these.

For untrusted code, use VM-isolated runtimes. Kata Containers wraps each container in a lightweight VM (~10MB overhead, ~200ms boot). gVisor intercepts syscalls in user space (the "Sentry"), trading some performance for kernel-attack-surface reduction. Firecracker is what AWS Lambda and Fargate use under the hood — micro-VMs in ~125ms with ~5MB memory footprint. All three plug into the OCI runtime interface, so Kubernetes can run them via runtimeClassName.

The runtimes that run containers in production

Different tools, same OCI spec.

The runtime stack has two layers. The high-level runtime (also called the "container engine") manages images, networking, volumes — Docker, containerd, CRI-O, Podman live here. The low-level runtime actually creates the namespaces and cgroups — runc is the dominant one; crun and youki are alternatives.

Docker Engine: The original. Includes a CLI, a daemon, image build, BuildKit. Dropped from Kubernetes 1.24 (December 2021) because the dockershim adapter was deprecated; Docker Desktop and standalone Docker are still huge for local dev.
containerd: Born from Docker, donated to CNCF in 2017, graduated 2019. The default runtime in EKS, GKE, AKS, kind, and most managed Kubernetes today. Lighter than Docker (no build, no networking magic), exposes the CRI directly to kubelet.
CRI-O: Red Hat's CRI-only runtime — narrowest scope of the three, deliberately. Default in OpenShift. Tighter security defaults than containerd; smaller surface.
Podman: Daemonless (no background process), drop-in CLI compatibility with Docker. Strong on rootless containers (a regular user can run Podman without sudo). Default on RHEL 8+ and Fedora.

Real-world picks. Build images on a developer laptop: Docker Desktop or Podman. Run containers in a Kubernetes cluster: containerd or CRI-O via the kubelet. Run untrusted user code (CI runners, sandboxed function execution, multi-tenant SaaS): Kata or gVisor under containerd, or Firecracker if you can use AWS infrastructure. The OCI standard makes the choice mostly operational, not architectural. For the cluster-side view, where the kubelet drives the CRI from pod spec to running container, see Kubernetes internals.

What actually happens on docker run nginx

Every piece above, in order, in about 300 milliseconds.

You type docker run nginx. The CLI doesn't run anything — it POSTs to the Docker daemon over a Unix socket, and the daemon delegates to containerd. containerd checks its local content store for the nginx image. Not there, so it talks to the registry: fetch the manifest for the latest tag, read the list of layer digests, then download each layer blob it doesn't already have, verifying every one against its sha256. Each layer unpacks into its own snapshot directory under /var/lib/containerd.

Next, the rootfs. containerd mounts an overlay filesystem: the read-only layers become lowerdir entries, a fresh empty directory becomes upperdir, and the merged view appears at a single mount point. That merged directory is the container's root filesystem. Nothing was copied — a second nginx container would reuse every lower layer and get only its own empty upperdir.

containerd writes an OCI bundle — a config.json next to that rootfs — and hands it to a shim process, which invokes runc. runc calls clone() with CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWIPC. The child is born into fresh namespaces: it sees an empty process table and a network stack with nothing but loopback. The runtime writes the child's PID into a new cgroup directory and sets memory.max and cpu.max from whatever --memory and --cpus flags you passed. From that write onward the kernel enforces the limits; the runtime's job there is done.

Still inside the child, before anything from the image runs: mount /proc and /dev, then pivot_root onto the overlay's merged directory. pivot_root swaps the root mount and unmounts the old one, so the host's filesystem isn't hidden — it's unreachable. Then the runtime drops capabilities, applies the seccomp profile, and calls execve() on the image's entrypoint: nginx -g 'daemon off;'. The exec replaces the setup code in place, so nginx inherits the namespaces, the cgroup membership, and a root it can't see out of. Inside its PID namespace it is PID 1, with PID 1's duties — reap zombies, handle signals. On the host it's just process 48-thousand-something.

Meanwhile the daemon wired up networking: a veth pair with one end inside the container's net namespace as eth0 and the other plugged into the docker0 bridge, plus a NAT rule if you passed -p. Total elapsed: maybe 300 ms warm, most of it image pull when cold. Nothing booted. No kernel loaded. One process was created with unusual flags, confined, and replaced itself with nginx.

Containers at scale: three case studies

Where the abstraction earns its complexity.

Netflix (Titus, ~3 million container starts per day). Built their own scheduler on top of Mesos + their own container runtime, then migrated to Kubernetes-on-Titus in 2024. Run a mix of containerd (most workloads) and Firecracker (for untrusted creator-uploaded code in their VFX pipeline). Memory limits are a hard rule — every container has a quantitative SLO and OOMs are pages.

AWS Lambda (trillions of invocations a month). Each function invocation runs in a Firecracker microVM — ~125 ms cold start, ~5 MB memory footprint per VM, ~thousands of VMs per host. The choice was deliberate: the multi-tenant boundary at AWS scale must survive arbitrary kernel exploits, and Firecracker's tiny attack surface (~50,000 LOC) meets that bar where a Linux container alone would not.

GitHub Actions runners. Public-cloud GitHub-hosted runners spin up a fresh VM per job (Azure VM, ~20-second boot). Self-hosted runners run inside containers (containerd, often with Kubernetes' Actions Runner Controller). The trade-off is exactly the security boundary: untrusted PR code on a hosted runner gets a VM; trusted internal code in a container is fine.

A closing note

Containers are not a virtualization technology. They are a packaging and isolation convention built from existing Linux primitives. Eight namespaces, a handful of cgroups, an overlay filesystem. Once you internalise that, every "container does X" question reduces to "which kernel feature is being used."