04 / 10

Internals / 04

Memory management

Every process boots into a private, mostly empty address space and then has to fill it: code to run, data to keep, a stack to call through, a heap to grow on demand. Memory management is the set of moving parts that hand out that space and take it back. This page walks the path an allocation takes, from a malloc call down through the allocator's free lists to the kernel syscalls that actually enlarge the process, and back up to the bugs that live in the gap.

What "memory" a process actually sees

The first thing to get straight is that a process never touches physical RAM directly. It sees a virtual address space: a flat range of addresses, all its own, that the hardware and kernel translate to physical pages behind its back. On 64-bit x86 that range is huge — 128 TB of usable user space on Linux — and almost all of it is empty. The translation from virtual to physical, the page tables that hold it, and what happens on a page fault are the subject of the virtual memory page; here we care about how that space is carved up and filled.

The space is divided into regions, each with a purpose. At the low end sits the text segment: the program's machine code, mapped read-only and shared between every process running the same binary. Above it, data holds initialised globals (a static int n = 42; lives here), and bss holds uninitialised or zero globals, which take no space in the file and get zeroed on first touch. Above those is the heap, the region that grows upward as the program asks for more dynamic memory. Near the top of the space is the stack, which grows downward, one frame per function call. And in the wide gap between heap and stack lives the mmap region, where shared libraries, memory-mapped files, and large dynamic allocations land.

Two of these regions are elastic. The stack grows automatically: when a function call writes below the current stack pointer into an unmapped page, the kernel sees the fault, recognises it as stack growth, and maps another page — up to the stack rlimit, after which you get a segmentation fault. The heap grows on request, and that request is where the allocator comes in. Everything that follows is, in effect, the story of how the heap and the mmap region get bigger and smaller as a program runs.

"Allocated" is not "used." Asking for memory only reserves a range of virtual addresses; the kernel does not back it with physical pages until you write to each one. A 1 GB allocation that you never touch costs almost nothing in RAM. This lazy backing, called demand paging, is why a process's virtual size (VSZ) is usually far larger than its resident size (RSS).

How the heap grows: brk and mmap

The heap has a single boundary the kernel tracks, the program break: the address just past the end of the heap. Moving the break up makes the heap bigger; moving it down makes it smaller. Two syscalls do this. brk(addr) sets the break to an absolute address. sbrk(increment) moves it by a relative amount and returns the old value, which makes "give me a bit more heap" a one-liner. Historically every malloc was built on sbrk: the allocator pushed the break up to get a slab of raw heap, then handed pieces of it out.

The break has a structural weakness, though: it is a single number. You can only release heap memory back to the kernel by lowering the break, and you can only lower it past memory that is contiguous and free. If you allocate A, then B, then free A but keep B, the break cannot move down past B, so A's pages stay charged to your process even though nothing uses them. For long-lived programs that allocate in waves, this turns the heap into a high-water mark that never recedes.

That is why modern allocators reach for mmap instead, especially for large requests. mmap asks the kernel for a fresh, independent region of the address space — anywhere in that wide gap, not tied to the break. Each mapping can be unmapped on its own with munmap, which returns its pages to the kernel immediately regardless of what else is allocated. glibc's allocator uses a threshold (default 128 KB, tunable with M_MMAP_THRESHOLD): requests below it come from the brk heap, requests above it get their own mmap so they can be returned cleanly when freed. The trade-off is that mmap is a heavier syscall and each mapping is rounded up to a whole page, so it would be wasteful for the thousands of tiny allocations a typical program makes.

Small allocations are cut from the shared brk heap and leave holes when freed; large ones get a private mmap that can be returned in full.

Inside malloc and free

A syscall per allocation would be far too slow, so the allocator does the real work in userspace and only touches the kernel when it needs more raw memory. The allocator's job is to take the big slabs it gets from brk and mmap and slice them into the exact-sized blocks your code asks for, then recycle those blocks as they come back. The data structure at the centre of this is the free list: a linked list of blocks that are available to hand out. Crucially, the links live inside the free blocks themselves — a free block's first bytes hold the pointer to the next free block — so the bookkeeping costs no extra memory while a block is idle.

Each block carries a small header just before the pointer it hands you, recording the block's size and whether it is in use. That header is why free(ptr) needs only the pointer and not the size: it reads the size from the bytes immediately before ptr. It is also why writing before the start of an allocation, or freeing a pointer that is not exactly what malloc returned, corrupts the heap — you are scribbling on or misreading that header. When you call free, the allocator flips the in-use bit, pushes the block back onto a free list, and — importantly — checks whether the physically adjacent blocks are also free. If they are, it coalesces them into one larger free block, so that a later large request can be satisfied from the merged space rather than failing amid a pile of small holes.

The free list threads through idle blocks using their own bytes. Adjacent free blocks merge so big requests can still be served.

Searching one big free list for a block of the right size would get slow, so real allocators bucket free blocks by size. glibc keeps several kinds of bins: fastbins for the smallest sizes, kept as simple last-in-first-out stacks that skip coalescing for speed; small bins with one exact size each; large bins that hold a range of sizes, sorted; and the unsorted bin, a holding area that recently freed blocks pass through before being filed. There is also a per-thread tcache, a small cache of recently freed blocks of each size that a thread can refill from without taking any lock at all — the fast path for the overwhelming majority of allocations. When a request comes in, the allocator looks in the bin for that size class first; only if it comes up empty does it fall back to splitting a larger block or asking the kernel for more.

The userspace view of all this — the API surface, who calls what, and how managed runtimes layer their own collectors on top — is covered in how memory allocation works.

Arenas and the cost of threads

A single global heap protected by one lock would serialise every allocating thread, which is fatal for a busy server. glibc's answer is arenas: independent heaps, each with its own bins and its own lock. The main thread uses the main arena, which sits on the brk heap. Additional threads get their own arenas, carved out of mmap regions, up to a cap that defaults to eight times the number of CPUs. A thread that finds its arena locked tries another rather than blocking. This is why a multithreaded program's heap is not one heap but several, and why two threads can be allocating at full speed at the same time.

Arenas cost memory, though. Each one keeps its own pool of free blocks, and a block freed in one arena is not available to another. A program with many threads can end up with a noticeably larger resident size than the same workload single-threaded, simply because free memory is scattered across arenas that cannot share it. Tuning MALLOC_ARENA_MAX down trades a little allocation throughput for a tighter memory footprint, which is a common knob on memory-constrained containers. Alternative allocators make different bets here: jemalloc and tcmalloc lean even harder on per-thread caches to cut contention, often at the cost of holding more memory in reserve.

Fragmentation: internal and external

Fragmentation is wasted space, and it comes in two flavours that are worth keeping apart. Internal fragmentation is space wasted inside a block: you asked for 50 bytes, the allocator rounded up to a 64-byte size class plus a header, and the leftover 14-plus bytes are yours by name but unusable. It is the price of having a manageable number of size classes instead of a custom size for every request, and it is bounded and predictable.

External fragmentation is space wasted between blocks: you have plenty of total free memory, but it is split into holes too small for the request in front of you. Allocate a thousand small objects, free every other one, and you now have a checkerboard — half your heap is free, but a request for a contiguous medium-sized block fails. Coalescing fights this by merging adjacent free holes, and size-class bins reduce it by keeping like sizes together, but no allocator eliminates it for adversarial allocation patterns. External fragmentation is the quiet reason a long-running server's memory creeps up even when its live data set is flat: the heap is full of holes it cannot reuse and cannot return.

This is what most "memory leaks" in C actually look like. True leaks — allocating and losing the pointer — are one cause of growth. But a great deal of "the service uses more RAM every day" is fragmentation: memory that is freed but stranded in holes, or charged to the heap above a live block the break cannot pass. The fix is often an allocator swap or an arena-count tweak, not a missing free.

The kernel's own allocator: slab

The kernel allocates memory too, and it has the same problem in reverse: it constantly creates and destroys small, fixed-shape objects — a task_struct for every process, an inode for every open file, network buffers by the thousand. Running a general-purpose allocator for those would waste time on size lookups and waste space on rounding. So the kernel uses slab allocation, introduced by Bonwick in Solaris and adopted by Linux.

The idea is a cache per object type. A slab cache for task_struct grabs whole pages from the lower-level page allocator and pre-divides each page into task_struct-sized slots. Allocating is then O(1): pop a free slot off the cache and return it. Freeing is O(1): push it back. Because every slot in a cache is the same size and shape, there is no internal rounding beyond the object's own size, and a freed object can be reused immediately by the next request of the same type without any merging. The cache can even keep freed objects partly constructed, so common fields do not have to be re-initialised on the next allocation. Linux ships this as SLUB, the modern, simpler reimplementation of the original SLAB design, and you can watch the caches live in /proc/slabinfo.

Underneath slab sits the allocator that hands out raw physical pages in power-of-two groups, which slab calls when a cache runs dry. That page-level machinery, along with zones, NUMA placement, and reclaim, belongs to the lower half of the kernel's memory stack; the point here is that slab gives kernel code the same "fast, fixed-size, no fragmentation" deal that bins and the tcache give userspace, specialised for the kernel's known set of objects.

The page is the unit

Step back and one fact runs through everything above: the kernel does not deal in bytes, it deals in pages, fixed-size chunks of 4 KB on most systems. The break moves in page-sized steps, mmap hands out whole pages, slab caches are built from pages, and physical RAM is tracked one page frame at a time. Your malloc(40) is a fiction the allocator maintains on top of pages it already owns; the kernel only ever sees page-granularity requests.

This matters because the page is also the unit the hardware translates and protects. Each page has its own mapping from virtual to physical and its own permissions, and accessing a virtual page that has no physical backing triggers a page fault, which the kernel handles by mapping a page (or killing the process if the access was illegal). That mechanism is how demand paging, copy-on-write after fork, and memory-mapped files all work, and it is the bridge between the address-space picture on this page and the page-table machinery on the virtual memory page. Allocators care about pages because every byte they hand out is, somewhere underneath, a page that the kernel had to fault in and will eventually want back.

When it runs out: the OOM killer

Linux lets processes allocate more memory than physically exists, on the bet that most of it will never be touched at once. This is overcommit, and it is why a 1 GB allocation can succeed on a box with 512 MB free. The bet usually pays off, but when too many processes cash in their reservations at the same time and the kernel cannot reclaim enough — no clean file-cache pages to drop, no swap left to push to — it has run out of options. Rather than fail an allocation that has already been promised, the kernel invokes the OOM killer.

The OOM killer picks a process and sends it SIGKILL to free its memory at once. The choice is driven by an OOM score, which roughly tracks how much memory the process is using, biased by the administrator hint oom_score_adj (from -1000, never kill, to +1000, kill first). The kernel logs the victim and the scores in dmesg, which is the first place to look when a process vanishes for no reason it logged itself. In practice you steer this: push oom_score_adj up on disposable workers and down on the one process you cannot lose, and on containers let the cgroup memory limit trigger a scoped OOM kill before the whole machine is in trouble. If you would rather have allocations fail honestly than be killed later, vm.overcommit_memory=2 turns overcommit off and makes malloc return null when the commit limit is reached — at the cost of refusing allocations that would have been fine.

The bugs that live here

Manual memory management is a small set of rules, and almost every memory bug is a violation of one of them. A leak is failing to free memory you no longer reference: the block stays marked in-use forever, the heap grows without bound, and eventually the OOM killer or an allocation failure ends the program. Leaks are the most benign of the family because the memory is at least not corrupted; tools like Valgrind and the address-and-leak sanitizers find them by tracking every allocation against its frees.

Use-after-free is the dangerous one: you free a block, the allocator hands those same bytes to the next request, and then your stale pointer reads or writes them — now you are reading someone else's data, or scribbling on it. Because the allocator threads its free list through freed blocks, a use-after-free write can also clobber the allocator's own bookkeeping, turning a quiet bug into heap corruption that surfaces somewhere completely unrelated. This is the workhorse of memory-safety exploits: an attacker who controls what gets allocated into a freed block can steer a stale pointer into attacker-chosen data.

Double free is freeing the same pointer twice. The second free pushes a block onto a free list that already contains it, so the list now has a cycle or a duplicate, and a later allocation may hand the same block out to two callers at once — corruption that, again, shows up far from the actual bug. Glibc detects some double-frees and aborts with "double free or corruption," but it cannot catch all of them. Round out the list with buffer overflows — writing past the end of a block into the next block's header, which corrupts size and flag bits and breaks the next coalesce — and you have the bulk of the CVE backlog in C and C++ software.

The structural escape from this whole class of bug is to stop doing it by hand. Garbage-collected runtimes track liveness automatically so use-after-free and double-free cannot happen; the trade is pauses and overhead, and the mechanics of one such collector are in Go's garbage collection. Rust takes a different route, proving ownership and lifetimes at compile time so the same guarantees cost nothing at runtime. Both are reactions to exactly the failure modes above: the rules are simple, but humans break them constantly, and the bugs are expensive.

Practical takeaways

A few things follow directly from the picture. Allocated memory is not used memory until you write to it, so watch RSS, not VSZ, when you care about real pressure. Steady growth in a long-lived service is as likely to be fragmentation as a true leak, so before hunting for a missing free, try a different allocator or a lower arena count and see if the curve flattens. Large allocations get their own mmap and are returned cleanly, while small ones leave holes in the shared heap, so the size mix of your allocations changes how well memory is reclaimed. And when a process disappears without a trace, read dmesg first — the OOM killer leaves a receipt. The deeper you go, the more these allocators look like the same idea repeated at every level: hand out fixed-size pieces from a pool, recycle them fast, and only talk to the layer below when you have to.

Memory management

What "memory" a process actually sees

How the heap grows: brk and mmap

Inside malloc and free

Arenas and the cost of threads

Fragmentation: internal and external

The kernel's own allocator: slab

The page is the unit

When it runs out: the OOM killer

The bugs that live here

Practical takeaways

Further reading