Why is a ring buffer faster than a linked-list queue?

Ring buffers store all elements contiguously in one array — sequential CPU cache access, no allocation per push, no pointer chasing. Linked lists allocate per node, scatter nodes across memory, and pay a cache miss per traversal step. For SPSC workloads, well-tuned ring buffers can do hundreds of millions of operations per second per core; linked-list queues struggle past tens of millions.

7 min read · Guide · Concurrency

How it works · Concurrency · Data structures

Ring buffers, a queue that never allocates.

A queue that doesn't allocate. A producer's pointer chases a consumer's pointer around a fixed array. Wait-free in the SPSC case. Used everywhere fast.

Parts01–07 InteractiveLive ring PrereqArrays / atomics

What is a ring buffer?

Two pointers, one array.

A ring buffer (circular buffer) is a fixed-size array used as a queue, with separate read and write indices that wrap around modulo the array length. Their key property: bounded memory and lock-free single-producer/single-consumer operation. LMAX's Disruptor (2010) brought ring buffers to mainstream attention; the Linux kernel uses them everywhere (network ring, io_uring, perf events).

A ring buffer is a fixed-size array with two indices: head (where to write next) and tail (where to read next). Producers advance head, consumers advance tail, both modulo the array size. When they meet, the buffer is empty (or full — there's a small bookkeeping detail to distinguish). It's a bounded queue.

That's the whole data structure. Allocate once, never touch the heap again. Constant per-operation cost. The catch: it's bounded. If producers run faster than consumers for too long, you have to either wait or drop. Production ring buffers always have a policy for that case.

How a ring buffer works: produce, consume, wrap around

Produce, consume, wrap around.

Below: a 16-slot ring laid out as a circle. Press Produce to advance head; Consume to advance tail. Watch the pointers chase each other; observe the wrap when head reaches the end and starts over at zero. (Compare with log-structured storage for the same shape on disk.)

0 / 16 slots used · head 0 tail 0

head — producer's next write tail — consumer's next read filled slot

Why ring buffers are always a power of two in size

A power-of-two size lets you mask instead of divide.

Computing i % N requires division — slow on most hardware. If N is a power of two, you can replace it with a bitwise AND i & (N - 1): instant, no division. Every fast ring buffer in the wild uses this trick, which is why ring buffer sizes are almost always 64, 1024, 65536 — never 100 or 1000.

The other reason for power-of-two sizes: pointer arithmetic stays cheap when wrapping. head + 1 on a 32-bit counter wraps naturally; mask with N-1 and you get the slot index. No conditional branches — and the same idea shows up in any decent hash table implementation.

Why single-producer, single-consumer needs no locks

One writer per pointer means no synchronisation at all.

With exactly one producer and one consumer, the ring buffer is wait-free without any synchronisation primitives. The producer only writes head; the consumer only writes tail. Each only reads the other's pointer — an atomic load on x86, free on most architectures.

This is why SPSC ring buffers underpin the fastest message passing in the world: kernel⇄userspace I/O queues (io_uring), audio callback ring buffers, GPU command queues, the LMAX Disruptor. One producer, one consumer, one fixed array. No locks, no allocations, no contention.

False sharing, and why the cache line is the enemy

Head and tail on the same cache line kills performance.

If the head and tail pointers sit in the same CPU cache line, the producer's write to head invalidates the consumer's cache line containing tail — even though they're different variables. The consumer's read suddenly costs a memory round trip. False sharing.

Production ring buffers pad the head and tail variables to separate cache lines (typically 64 bytes apart). Java uses @Contended; C/C++ uses alignas(64) or explicit padding. Skip this and the SPSC promise of "no contention" is a lie.

Multiple producers need CAS and sequence numbers

Each producer claims a slot with an atomic compare-and-swap.

Multiple producers writing to one ring need a way to claim a slot atomically. Compare-and-swap on a sequence counter does it: each producer atomically increments the next-write counter, gets back the slot index, writes there. Consumers do similar dance.

The LMAX Disruptor formalised this pattern in 2010. Its sequencer barrier protocol allows multi-producer-multi-consumer ring buffers to hit hundreds of millions of ops per second on a single box, much like a high-throughput Kafka partition. The pattern is dense; the code is short. Worth reading.

Where ring buffers show up in real systems

Anywhere low latency matters, you find a ring buffer.

io_uring

Kernel I/O.

Linux's modern async I/O: two ring buffers shared between userspace and kernel — submission queue and completion queue. Zero syscalls in the hot path.

audio

Real-time audio.

Audio callbacks have hard deadlines (typically 5–10 ms). Lock-free SPSC ring buffers feed samples from the application thread to the audio thread without ever pausing.

log4j2

Async logging.

A log call can't block on disk. Async appenders push log events into a ring; a single I/O thread drains the ring and writes to disk. The Disruptor pattern, applied to logging.

The LMAX Disruptor: a ring buffer doing 6 million orders a second

The case study every low-latency engineer reads.

In 2010, the LMAX Exchange published a paper describing a ring-buffer-based design that processed 6 million orders per second on a single thread with sub-millisecond latency. The "Disruptor" was their alternative to the standard Java BlockingQueue, which they had measured at ~5 million ops/sec and 10× the latency.

Three insights that made it fast:

Sequence numbers, not pointers: Every slot has a 64-bit sequence number. Producers and consumers track their position by integer; modulo (cheap) replaces the head/tail pointer arithmetic that requires synchronisation.
Mechanical sympathy: The ring buffer's slots are aligned and padded to one cache line each (64 bytes on x86). Producer and consumer cursors live in separate cache lines so they don't false-share. Memory layout drives microsecond-level performance.
One writer per cursor: A single producer writes (CAS-free), or multi-producers claim slots via CAS but only one writer per slot. This avoids the lock-free queue's typical retry-loop on contention.

The Disruptor pattern shipped in Apache log4j2's AsyncAppender (default since 2014), in algorithmic-trading systems, and in real-time-bidding ad servers. The "single-writer principle" — only one thread writes to any given memory location — became a widely-cited design heuristic for low-latency code.

io_uring: Linux async I/O is just two ring buffers

Zero syscalls in the hot path.

io_uring (Jens Axboe, Linux 5.1, 2019) is the modern async I/O interface that replaces epoll + AIO. It uses two ring buffers shared between the kernel and userspace: SQ (submission queue) and CQ (completion queue).

How the rings remove syscalls. An app writes a request into the next free SQ slot, increments the SQ tail (one atomic write). When the kernel's worker thread gets to it (no syscall — the kernel polls the SQ), it executes the I/O and writes the result into the CQ. The app reads from the CQ when convenient. Zero syscalls per I/O; the only cost is the atomic counter and the cache line bounce.

Real numbers from io_uring's introductory paper: on a 1M-IOPS NVMe device, epoll + AIO hit ~430k IOPS at ~30% CPU. io_uring hit ~1.7M IOPS at ~12% CPU. 4× the throughput at 40% the CPU — purely from removing syscall overhead via the ring-buffer ABI.

Where io_uring ships. cassandra 5.0, mysqld 8.0+, postgres 17 (io_method=io_uring as a 2024 experimental option), tigerbeetle (entirely built on it), and any modern Rust async runtime via the tokio-uring crate. The C library liburing is the standard interface; it hides the SQ/CQ pointer arithmetic behind ergonomic helpers.

A closing note

Ring buffers are the data structure that powers most low-latency software you have ever used. Pre-allocated, lock-free, cache-friendly, bounded. Once you see them, you see them everywhere — kernel APIs, audio drivers, exchange order books, network drivers. The whole design is in two pointers chasing each other around an array.