10 min read · Guide · Systems
How it works · Systems

How a program gets memory and gives it back.

Stacks grow down. Heaps grow up. Allocators sit in between, juggling fragmentation, locality, and contention. What actually happens when you call malloc.

Parts01–08 InteractiveHeap simulator PrereqPointers · processes

What is memory allocation?

A process is a map of regions.

Memory allocation is how programs request and release memory at runtime. The stack grows and shrinks with function calls (LIFO, fast); the heap handles dynamic allocations through malloc/free or a garbage collector. Production allocators — jemalloc, tcmalloc, mimalloc — manage size classes, free lists, and per-thread arenas to avoid fragmentation and contention.

When the OS launches a program, it hands it a virtual address space — typically 256 TB on 64-bit Linux. That space is partitioned: text (the executable code), data (initialized globals), bss (uninitialized globals), heap (dynamic, grows up), thread stacks (one per thread, grow down), shared libraries, and at the top, the kernel reserves space for itself.

None of these regions are physically contiguous. The page table maps virtual to physical pages on demand — see how virtual memory and the TLB work for the four-level walk every load actually performs. Reading /proc/self/maps shows the layout: every distinct entry is one VMA (virtual memory area), tracked by the kernel.

# /proc/self/maps for a small process
55c8df400000-55c8df403000 r--p 00000000 text (code)
55c8df403000-55c8df40c000 r-xp 00003000 text (more code)
55c8df621000-55c8df623000 rw-p 00021000 data
7f8d10000000-7f8d10100000 rw-p 00000000 heap
7f8d2c8a8000-7f8d2c8c0000 r-xp 00000000 /lib64/ld-linux ...
7ffd9f8b8000-7ffd9f8d9000 rw-p 00000000 [stack]
7ffd9f9d4000-7ffd9f9d8000 r--p 00000000 [vvar]
ffffffff80000000-ffffffffffffffff [kernel]

Stack vs heap: why one is fast and one is flexible

LIFO is fast. Random is not.

The stack is a LIFO region tied to a thread. Each call frame is pushed on entry and popped on return — bumping a register. Allocation is one instruction; deallocation is automatic. The trade is shape: stack memory is sized at compile time, and the lifetime is exactly the call. Recursion or huge locals will blow the stack.

The heap is a free pool with arbitrary lifetimes. You ask for memory, you get a pointer, you free it whenever. Allocation is bookkeeping — find a chunk that fits, split, return. Deallocation is more bookkeeping — coalesce neighbors. Threads contend on the same heap, so allocators add per-thread caches to dodge locks — the same idea behind a thread pool's per-worker queues.

Stack

Free for the price of a return.

One pointer bump. Cache-warm. Bounded — typical 8 MB. Lifetime tied to scope. No contention. Cannot escape across threads. C++ RAII; Go inlined values; almost everything in Rust without <Box>.

Heap

Lifetime, anywhere.

Allocator picks a slot, splits, returns. Thread-shared. Fragments over time. Subject to GC, refcounting, or manual free. Slow path can be 100s of cycles; fast path under thread cache is closer to 20.


How a heap fragments as you allocate and free

Allocate, free, watch the holes form.

Below: a 24-slot heap. Each row of "Alloc" finds a free run of contiguous slots and claims it. Free a block in the middle and now you have a hole — small allocations may use it, but a big request may have to skip it. That mismatch is fragmentation.

Allocated slots
0/ 24
Free runs
1
Active blocks
0
— all free —

How an allocator organises free memory by size

Separate free lists for small, medium, and large blocks.

Modern allocators (glibc's ptmalloc, jemalloc, tcmalloc, mimalloc) all do roughly the same thing differently. They keep separate free lists for different size classes — small (under ~16 KB), medium, and large. Small allocations come from per-thread arenas (no lock needed). Medium come from per-process pools with size-class buckets. Large go straight to mmap and bypass the heap entirely.

On free, the allocator tries to coalesce — if the neighbor is also free, merge them. This fights fragmentation but costs CPU. tcmalloc and jemalloc lean on size classes to skip coalescing in the hot path; they accept some internal fragmentation in exchange for speed.


Internal vs external fragmentation: two kinds of waste

Rounding waste inside a block versus unusable gaps between them.

External fragmentation is what the simulator above shows: enough total free memory exists, but no single contiguous run is large enough. Mitigations: power-of-two size classes, bump allocators, occasional compaction (in GC'd languages), or just plain mmap'ing a fresh region.

Internal fragmentation is rounding waste — you ask for 17 bytes, you get a 32-byte slot, and 15 bytes are unused. Smaller size classes reduce it but increase metadata cost. Most allocators target ~10% waste in steady state.


Arena and slab allocators: specialising when sizes are fixed

When you know the size, specialize.

If you allocate millions of one fixed size — kernel objects, network buffers, pooled structs — a general-purpose allocator wastes both space and time. The slab allocator (Bonwick, 1994; in Linux as the SLAB/SLUB subsystem) keeps caches of pre-initialized fixed-size objects per size class. Allocation: pop from a freelist. Free: push back. Constant time, no metadata, no fragmentation within the slab — close kin to a ring buffer for fixed-shape records.

Userspace equivalent: object pools. Game engines, network servers, and JIT compilers maintain pools of "things I will need many of" — connections (pooled the same way a Redis client multiplexes them), render commands, AST nodes. The cost saved is not just the allocator call but cache warmth and metadata.

jemalloc

Per-thread arenas cut contention

Defaults to many arenas; each thread sticky-sorts to one to dodge lock contention. The Facebook/Meta house allocator — fierce on long-running servers.

tcmalloc

Thread-local caches for small objects

Each thread keeps a free-list cache; spillover goes to a central heap. Google's house. Fast small allocs, big in C++ shops.

mimalloc

Page-based free lists

Microsoft Research. Very fast on the common path; segregated free lists per page; competitive with the others while being simpler to read.


Virtual vs physical memory: reserving is cheap, backing it is not

Virtual is cheap. Physical is not.

An mmap(64 GB) succeeds on a 4 GB machine. It just reserves virtual addresses; no physical pages are bound. The first time you write to a page, the CPU traps, the OS allocates a 4 KB physical page, updates the page table, and resumes you. This is a minor page fault — typically a few microseconds.

Major page faults are different: the OS has to read from disk (anonymous swap, or a backing file). They are 10,000× slower. RSS (resident set size) shows physical pages currently bound; VSZ shows virtual reservation. They differ by orders of magnitude in normal programs. On multi-socket machines, where physical pages get allocated also matters — see NUMA and memory bandwidth for why first-touch placement is one of the most common production-perf bugs.


The three classic memory bugs

Use-after-free, double-free, and leaks.

Use-after-free: free() returns memory to the allocator. If you keep the pointer and read it, you read whoever got the slot next. The allocator may even rewrite the bytes immediately (debug builds explicitly do — 0xdeadbeef, 0xfeedface).

Double free: passing the same pointer twice corrupts the free list. Modern allocators detect this with magic numbers and abort; older ones just silently mis-track.

Leak: alloc without matching free. Not always fatal — short-running programs can leak forever — but in long-running services, RSS climbs until OOM. AddressSanitizer (ASan) and Valgrind catch all three at the cost of 2–5× runtime.

Production allocators: jemalloc, tcmalloc, mimalloc, ptmalloc

Why every big shop ships a custom allocator.

ptmalloc / glibc malloc
The default on Linux. Has per-thread arenas, bins by size class, fastbins for small allocs. Decent baseline; loses to specialised allocators on multi-threaded workloads.
jemalloc · 2005, FreeBSD/Facebook
Per-thread caches with arena assignment. Battle-tested at Meta scale; default in Rust until 2018. ~10-30% faster than glibc on multi-threaded server workloads, lower fragmentation.
tcmalloc · Google
Thread-Caching malloc. Per-thread free lists with size classes; central heap when caches drain. The allocator behind every Google service. Open-sourced separately as part of Abseil.
mimalloc · Microsoft 2019
The latest allocator from Microsoft Research; benchmarks show ~10% faster than jemalloc/tcmalloc on common workloads with simpler internals (~3,500 LOC). Used in Azure components.
scudo · LLVM
Hardened allocator with built-in mitigations against use-after-free and buffer overflows. Default on recent Android. Trade ~10-15% performance for memory-safety bug detection.

Switching is easy. Most Linux apps can switch allocator just by linking libjemalloc.so via LD_PRELOAD. Many production teams do this for free 10-20% throughput gains on memory-heavy services. Redis, MariaDB, ClickHouse, and Cassandra all ship with jemalloc by default.


Memory bugs in production: leaks, fragmentation, double-free

Three failure modes, three diagnostic approaches.

Memory leaks. The classic: allocate, forget to free. Symptoms: process RSS grows monotonically until OOM kill. Diagnosis: heaptrack (Linux) tracks every allocation and shows what's still alive. Valgrind massif for offline analysis. jemalloc's built-in profiler with the MALLOC_CONF=prof:true env var.

Fragmentation. RSS doesn't shrink even though heap usage is low — the allocator can't return free memory to the OS because of holes. More common in long-running services. Diagnosis: jemalloc-stats shows active vs resident; a large gap signals fragmentation. Fix: switch to a compacting allocator (mimalloc), or accept periodic restarts.

Use-after-free, double-free, buffer overflow. The dangerous trio in C/C++. Modern compilers detect many at runtime via AddressSanitizer (ASan) — link with -fsanitize=address and the runtime catches each invalid access. ASan adds ~2× CPU overhead; useful in CI and staging, not production. Hardware MTE (Memory Tagging Extension, ARMv8.5+) does the same in production with ~5-10% overhead.

The Rust answer. Rust's ownership system catches all three at compile time. The cost is the borrow-checker learning curve; the win is C/C++-class performance with memory-safety guarantees. Why the kernel, browsers, and security-critical systems are slowly migrating.



A closing note

Memory is the cheap resource that most programs treat as free until it isn't. Locality, fragmentation, and lifetime are the three knobs that decide whether a workload is fast or sad. Pick the right allocator, and most of those knobs turn for you.

Found this useful?