Memory Allocator Simulator: malloc, free, and what fragments.

A 64-cell heap. Malloc with first-fit, best-fit, or buddy. Free in any order. Watch external fragmentation appear within a few cycles even when half the heap is "free." Switch to the buddy allocator and watch power-of-two splits coalesce on free. Below: a slab allocator for fixed-size kernel objects.

used
0
free
64
frag
0%

algo: size: 8
Heap (64 cells) · click a block to free
blocks: 1 (1 free) · largest free run: 64
Slab allocator — fixed-size object caches (the kernel side)
task_struct size 8
— no slabs allocated yet —
dentry size 4
— no slabs allocated yet —
kmalloc-16 size 16
— no slabs allocated yet —
Log
Heap initialised. One free block of 64 cells.

What you're looking at

The wide bar is a 64-cell heap. Coloured blocks are live allocations labelled with an id and size; pale blocks are free runs. Click a live block to free it. The size slider and malloc button carve a new block out using the chosen policy, and under buddy the request rounds up to the next power of two. The stats line tracks block count, free-block count, and the largest contiguous free run, while the meters up top show used, free, and a fragmentation percentage. The lower panel is a separate slab allocator handing fixed-size objects out of page-sized slabs.

Click "preset: fragmentation," then drag the slider to 16 and malloc. It fails even though the free meter still reads 32 cells. That gap between "free total" and "largest free run" is external fragmentation, and it is the moment worth sitting with. Then switch to buddy and free paired blocks: they coalesce back upward, but only when both halves of a pair are free, which is why buddy trades wasted bytes inside each block for clean merging.


Different workloads, different rules

The right allocator depends on what you're allocating and how often.

A program that allocates one buffer and never frees it has no allocator problem. A program that allocates a million tiny objects a second has the same problem the kernel has, and both end up at the same answer: maintain caches of frequently-used sizes, return memory to them on free, and only fall back to the underlying heap or buddy allocator when the cache is exhausted. This is why glibc's malloc, jemalloc, tcmalloc, and the kernel's slab subsystem all look structurally similar at the top — per-thread or per-CPU caches in front of a global free list.

First-fit and best-fit are the textbook free-list algorithms. First-fit (scan free list, take the first block that fits) is fast and fragments the head of the heap. Best-fit (scan all free blocks, pick the smallest that fits) wastes less space per allocation but is slower per call and tends to leave many tiny unusable fragments. Both are mostly historical; modern malloc implementations use size-class segregated free lists that combine the speed of first-fit with the locality of size matching.

Buddy rounds every request up to the nearest power of two and tracks free blocks by size class. Two free blocks at the same size whose addresses differ only in the size's bit ("buddies") can always be coalesced into a block of the next class up. This makes coalescing O(1) and bounds external fragmentation, at the cost of up to 50% internal fragmentation per allocation. The Linux kernel page allocator is a buddy allocator with orders 0 through 10 (4 KiB to 4 MiB), and slab caches sit on top of it to amortise the internal-fragmentation cost for fixed-size objects.


The kernel's task_struct never touches the buddy allocator

Or rather, it does once, then never again.

The kernel allocates a struct task_struct every time a process or thread is created — that's potentially thousands of times a second on a busy server. Each task_struct is the same size, around 12 KiB on x86_64. If the kernel went to the buddy allocator for each one, it would pay the buddy walk on every fork, get a 16 KiB block (next power of two), waste 4 KiB to internal fragmentation, and put pressure on the page-level free list.

The slab allocator (Bonwick 1994, then SLUB in modern Linux) sits on top of the buddy allocator and caches fixed-size objects. The first kmem_cache_alloc(task_cache) triggers a buddy allocation of, say, one 4 KiB page (a "slab"), carved into N slots of task_struct-sized objects. Subsequent allocations hand out slots from the slab — no buddy call, near-zero overhead. Free returns the slot. Only when a slab is entirely empty does it (sometimes, on memory pressure) get returned to the buddy pool.

Every common kernel object has a cache: task_struct, vm_area_struct, dentry, inode, plus generic kmalloc-NN caches at power-of-two sizes for arbitrary allocations. slabtop shows them. When people report "the kernel is using 8 GB of memory but ps shows only 2 GB of process RSS" the explanation is almost always slab caches — usually dentry or inode for a workload with lots of file access.


The user-space race to beat glibc

And why every database ships with its own malloc.

Glibc's malloc (ptmalloc, derived from Doug Lea's dlmalloc) is fine for most programs and bad for some. Its arenas — one per thread, falling back to a shared pool — protect against contention but can fragment badly under bursty allocation patterns. Long-running servers tend to grow their virtual address space indefinitely, even when their actual working set is small, because returning memory to the OS requires whole, fully-free pages and glibc's coalescing rarely produces them.

jemalloc (Evans, 2006) won the database race — Postgres uses it optionally, Cassandra uses it, Redis used it for years. It segregates allocations into size classes per-thread, returns memory to the OS aggressively with madvise(DONTNEED), and reports detailed stats. tcmalloc (Google) takes a similar approach with even more aggressive per-thread caching. mimalloc (Microsoft Research, 2019) goes further — segregated free lists, heartbeat-based coalescing, lock-free fast paths, all small enough to embed.

The pattern: every modern allocator is layered the same way — per-thread cache for hot small allocations, per-arena medium pool, large allocations go straight to the OS via mmap. The differences are in the constants and the eviction heuristics. Switching glibc malloc for jemalloc by setting LD_PRELOAD can reduce a long-running server's RSS by 20-40% without any code change. It's not free magic — the workload must be allocation-heavy for the swap to matter — but for the workloads it helps, it helps a lot.

Found this useful?