What is the difference between a queue and a stack?

Both are linear collections that add and remove from one or both ends. A queue is FIFO (first-in, first-out): items leave in the order they entered. A stack is LIFO (last-in, first-out): items leave in reverse order. Queues model fair processing (request queues, print spoolers); stacks model nested context (function calls, undo history, parser state).

Queue, Stack, Deque, Priority Queue Simulator: four orderings, one operation each.

These four structures take the same pushes and hand items back in different orders: a queue is FIFO, a stack is LIFO, a deque opens both ends, a priority queue serves the smallest key first. Four disciplines you'll see again and again, with one canonical use each.

Size

Discipline

Front → back

FRONT

— empty —

BACK

Recent ops

— quiet —

What you're looking at

The four buttons at the top pick a discipline; the control row underneath changes to match, so a queue gets Enqueue/Dequeue while a deque gets four buttons for both ends. The strip in the middle is the collection itself, drawn front-to-back with the endcaps telling you which end is which. Each item carries the number it was inserted with, and in priority mode also a random p. The log records every operation in order.

Insert five items in Queue mode, then dequeue: they come out 1, 2, 3, the order they arrived. Switch to Stack and pop instead — same insertions, but now they leave 5, 4, 3, in reverse. The data structure is identical at the cell level; only the exit rule changed. The one that should surprise you is Priority queue: insert a handful and watch extract-min pull the lowest p first, ignoring arrival order entirely, so a number you added last can leave first. That is the whole reason schedulers and Dijkstra reach for a heap instead of a plain queue.

Queues, stacks, deques, and priority queues — what's the difference?

Producer fast, consumer slow.

Queues, stacks, deques, and priority queues are four orderings of a sequence — FIFO, LIFO, double-ended, and ordered-by-priority — each defining what "the next item" means. The CPU call stack, the OS run queue, the network packet queue, the React work queue, the Kafka topic, and the Java ExecutorService all instantiate one of these four. Pick the right ordering for the problem and most concurrency code becomes obvious.

Imagine a tiny system: one component generates work, another component does the work. The producer fires off events at five thousand per second; the consumer chews through them at one thousand per second. Without anywhere to put the surplus, the producer must wait, the consumer is always working flat-out, and the moment the producer or any of its callers needs to be responsive — say, an HTTP server returning a 200 OK before the work is done — the system grinds to its slowest component's pace. What's missing is a place to set the work down.

That place is a queue. A queue is the simplest possible buffer between producer and consumer: items go in at one end, come out at the other, in the order they arrived. Producers don't wait for consumers; they just enqueue. Consumers don't ask producers for the next item; they just dequeue. The two run at their own speeds, and the queue's depth absorbs the difference. As long as the long-run average production rate doesn't exceed the consumption rate, the queue stays bounded; if it does, the queue grows without limit and something else has to give — usually backpressure, where the producer gets a signal to slow down or fail, or the queue drops items deliberately to stay healthy.

The mirror image of the queue is the stack: items come out in reverse order. Stacks are the right shape when the natural order of work is "do this thing, but pause first to handle that thing, and don't forget to come back". Every function call you've ever written uses a stack — the call stack, where each invocation pushes a frame and each return pops one. Recursive algorithms (depth-first search, expression parsing, undo histories) all work because the stack remembers exactly what to come back to. Sometimes you want LIFO; sometimes you want FIFO; the simulator above lets you push items to one and pop from the other to feel the difference.

Two anchoring numbers. A modern lock-free queue (Java's ConcurrentLinkedQueue, Disruptor's ring buffer, the LMAX architecture) handles tens of millions of enqueues per second per core; the dominant cost is a single compare-and-swap on the tail pointer plus the cache-line bounce when producer and consumer touch the same slot. A naive locked queue handles maybe one million per second on the same hardware, and contention pegs at the lock. The data structure looks identical at the abstract level; the implementation makes a 10–50× difference. The rest of this article walks through the formal definitions, the call-stack story behind every running program, the variants (deque, priority queue, double-ended), and the named queue systems (Kafka, RabbitMQ, SQS, Redis Streams, Disruptor) that hold most of the world's asynchronous work.

Origins — cellars, queues, and the abstract data type

Cellars, queues, and the abstract data type.

The stack as a named data structure entered computer science through Klaus Samelson and Friedrich Bauer at the Technische Hochschule München, who in 1957 patented what they called the Kellerprinzip (the “cellar principle”) for evaluating arithmetic expressions in the ALGOL compiler they were building. Their patent — granted in 1957 in Germany and published in 1959 — described a last-in-first-out store with two operations, conventionally rendered today as push and pop. The stack later acquired its English name from Alan Turing's earlier 1945 ACE proposal, which used the word “burying” for what was effectively the same operation, but the Samelson–Bauer construction is the one production compilers eventually inherited.

Bauer's contemporaries Edsger Dijkstra and Heinz Rutishauser independently arrived at similar ideas, and the LIFO discipline became canonical when Dijkstra gave his 1961 paper Algol 60 Translation at IFIP, formalising the role of an operator stack and an operand stack in expression evaluation — the algorithm later popularised as the shunting-yard. Reverse Polish notation, the postfix arithmetic that drops out of stack evaluation naturally, had been described decades earlier by the Polish logician Jan Łukasiewicz in 1924 in his work on propositional calculus. RPN reached mass-market hardware through Hewlett-Packard's 1968 HP 9100A and the legendary 1972 HP 35 calculator, which evaluated postfix using a four-deep operand stack visible to the user as registers X, Y, Z, T.

The queue entered computing from a different lineage entirely — operations research and telephony. Agner Krarup Erlang, working for the Copenhagen Telephone Company, published The Theory of Probabilities and Telephone Conversations in 1909, modelling waiting calls as a Poisson arrival process and the switchboard as a finite server. The 1917 follow-up gave the formulas every queueing-theory textbook still teaches: Erlang B for blocked-calls-cleared, Erlang C for blocked-calls-waiting. By the time stored-program computers needed to share their CPUs across multiple processes, the formal apparatus was already there. The unit name erlang, used today to measure offered traffic intensity, is named for him.

Both shapes — LIFO stack and FIFO queue — are abstract data types: defined by their operations, not their representation. The 1974 textbook by Aho, Hopcroft, and Ullman (The Design and Analysis of Computer Algorithms) made the ADT the unit of algorithmic discussion, and from that point onward you can read any operating-systems paper, any compiler text, any concurrency library and find both shapes appearing exactly where the access pattern demands.

Stacks — every function call is a push

Every function call is a stack push.

Open a debugger, hit a breakpoint, look at the call stack: every active function call is a stack frame — a contiguous chunk of memory holding the function's parameters, local variables, the return address that says where to resume in the caller, and saved registers. Calls push frames; returns pop them. The stack pointer (rsp on x86-64, sp on ARM) names the address of the topmost frame and decrements as the stack grows downward. The hardware doesn't enforce LIFO — you can add rsp, 16 and effectively pop two slots at once — but the discipline is honoured by every calling convention.

Frame size is not free. On Linux the default per-thread stack is 8 MB (controlled by ulimit -s and the RLIMIT_STACK rlimit). The main thread inherits this; pthreads default to 8 MB on glibc but only 2 MB on musl. Go goroutines start with an 8 KB stack that grows on demand, copying the live data into a larger segment when needed — the cost is paid only by goroutines that actually recurse deeply. CPython's recursion limit is 1000 frames by default (sys.getrecursionlimit()), set deliberately low because each Python frame is hundreds of bytes and the C stack runs out long before the heap does. V8's typical limit is around 10000–15000 calls. JVM threads default to a 512 KB stack on most platforms.

Recursion is the most direct consumer of stack space. A naive recursive descent over a million-element tree pushes a million frames; a single-threaded program with an 8 MB stack will SIGSEGV well before that. Tail-call elimination — rewriting return f(x) as a jump rather than a call — would solve this, and Scheme has required it since R5RS in 1998, but mainstream production runtimes (V8, CPython, OpenJDK) do not implement it. The workaround is to convert the recursion to iteration with an explicit stack on the heap: a std::vector or a Python list used as a LIFO. The shape of the algorithm is identical; only the storage moves from the C stack to the malloc heap, where you control the size.

The dual relationship is just as direct: depth-first search uses a stack (explicit or via recursion), breadth-first search uses a queue. Replace one with the other and the traversal order flips. Dijkstra's shortest-path algorithm uses a priority queue; A* uses a priority queue keyed on f-score; Prim's minimum-spanning-tree uses a priority queue. The choice of ADT is the algorithm.

# CPython interpreter loop, simplified
# Every CALL_FUNCTION opcode pushes a new PyFrameObject:
struct _frame {
 PyObject_VAR_HEAD
 struct _frame *f_back; // previous frame (i.e. caller)
 PyCodeObject *f_code; // bytecode of this function
 PyObject **f_valuestack; // value stack (operand stack)
 // ...
};
# Default recursion limit: sys.getrecursionlimit() == 1000.
# Each frame is ~520 bytes on CPython 3.11; the C stack
# fills long before that limit is reached.

Four orderings — FIFO, LIFO, priority, deque

Four orderings, and how to pay for them.

The textbook stack is an array with a single index. push writes at index n and increments; pop decrements and reads. Both are O(1) worst-case if you have room; if the array fills you double its capacity, copying the contents (O(n) work amortised across the doublings to O(1) per push). C++'s std::stack is a thin adapter around std::deque; Java's java.util.ArrayDeque is the modern replacement for the deprecated java.util.Stack (which extended Vector and synchronised every operation pointlessly).

The textbook queue is harder to make O(1). A linked list with head and tail pointers does it: enqueue at tail, dequeue at head, both O(1) per call but with a malloc per enqueue and a cache miss per dequeue. The ring buffer (or circular buffer) is the production answer: a fixed-capacity array with two indices, head and tail, that wrap modulo capacity. Linux's kernel kfifo, FreeBSD's mbufq, and the network-card receive rings on every NIC since the 1980s use exactly this. Power-of-two capacity lets the modulo be a bitwise AND; the entire fast path is one ALU op plus one memory access. The two-stacks-make-a-queue construction (push on stack A; when popping, drain A into B and pop from B) is amortised O(1) per operation but has worst-case O(n) for individual dequeues — a perfectly valid queue for batch workloads, dangerous for low-latency.

The deque (double-ended queue, pronounced “deck”) supports push/pop at both ends. C++'s std::deque is implemented as a vector of fixed-size blocks; Python's collections.deque uses a doubly-linked list of 64-element blocks. Both give O(1) at either end and O(n) random access. The work-stealing schedulers in the Cilk runtime, Intel TBB, Rust's Rayon, and Java's ForkJoinPool all use deques: the owner pushes and pops at one end (LIFO, for cache locality) while thieves steal from the other end (FIFO, for fairness). Chase–Lev's 2005 lock-free deque (Dynamic Circular Work-Stealing Deque, SPAA) is the canonical implementation.

The priority queue drops the FIFO/LIFO discipline entirely; the next element out is the one with the highest priority, regardless of insertion order. The standard implementation is a binary heap, a complete binary tree laid out in an array such that the parent of index i is (i-1)/2 and the children are 2i+1 and 2i+2. Insert and extract-min are both O(log n). Python's heapq, C++'s std::priority_queue, Java's java.util.PriorityQueue all wrap a binary heap. The Fibonacci heap of Fredman and Tarjan (1987) gives O(1) decrease-key amortised — relevant for Dijkstra at scale — but the constant factors are bad enough that production graph engines almost always pick a binary heap or a pairing heap instead.

Discipline	Order	Push / pop	Peek	Backing
Stack	LIFO	O(1) amortised	top, O(1)	array, doubling
Queue	FIFO	O(1) amortised	front, O(1)	linked list / ring
Deque	both ends	O(1) per end	front + back, O(1)	block list / ring
Priority queue	by key	O(log n)	min/max, O(1)	binary heap
Ring buffer	FIFO, bounded	O(1) worst-case	head, O(1)	fixed array, mod cap

# Power-of-two ring buffer, the kfifo pattern
struct kfifo {
 unsigned char *buffer;
 unsigned int size; // must be a power of two
 unsigned int in;  // next write index
 unsigned int out; // next read index
};
# enqueue:
buf[in &amp; (size - 1)] = byte;
in++;
# dequeue:
byte = buf[out &amp; (size - 1)];
out++;
# Lock-free for single producer, single consumer
# (memory barriers around in/out updates).

Where queues and stacks live — schedulers, brokers, runtimes

Schedulers, brokers, and the runtimes you ship on.

The Linux Completely Fair Scheduler — the kernel's process scheduler from 2.6.23 (October 2007) until it was replaced by EEVDF in 6.6 (October 2023) — kept its runqueue as a red-black tree keyed on virtual runtime, with O(log n) insert and pick-leftmost. Earlier Linux schedulers had used circular doubly-linked queues per priority. The runqueue is the single most performance-critical data structure in any general-purpose kernel; its operations happen on every context switch, of which there are millions per second on a busy server.

Application-level message brokers are queues at the architectural level. Apache Kafka (LinkedIn, 2011) is fundamentally a partitioned, replicated, append-only log indexed by offset; consumers pull and track their own offsets. RabbitMQ (released 2007, originally an Erlang implementation of AMQP 0-9-1) uses Erlang's mailbox model, with each queue an independent process owning its messages. Amazon SQS launched in 2006 as the first production AWS service, billed at less than a millicent per request; standard queues guarantee at-least-once delivery and best-effort ordering, while FIFO queues guarantee exactly-once processing within a 5-minute deduplication window at 300 transactions per second per message group (3000 with batching).

Inside the application runtime, queues are everywhere. The Go runtime's scheduler keeps a per-P (processor) local runqueue of 256 goroutines plus a global runqueue; when a P empties its local queue it steals half the goroutines from a random peer. Tokio's runtime::scheduler::multi_thread uses bounded MPMC channels for each worker, plus a global injector queue. Java's java.util.concurrent.LinkedBlockingQueue — the default backing store for ThreadPoolExecutor — is a singly-linked list with separate head and tail locks (Doug Lea, JSR-166), achieving high concurrency by allowing one producer and one consumer to make progress simultaneously.

Network stacks are queues all the way down. A TCP socket has a send buffer (a queue of bytes the kernel will transmit) and a receive buffer (bytes the kernel has accepted but the application hasn't read). NICs have multi-queue receive rings — modern Mellanox ConnectX-7 cards expose up to 256 RX queues per port, each independently steered to a CPU core via RSS hashing. The DPDK and Linux io_uring abstractions both surface these rings as user-space data structures, eliminating the syscall-per-packet cost. Every layer of the OSI model has a queue at its boundary; the queues are the system.

Bounded by construction

Production rule: pick a bounded queue. Java's LinkedBlockingQueue() default constructor is unbounded — a footgun masquerading as a default. ArrayBlockingQueue(cap) and the kernel's kfifo are bounded by construction. Backpressure is then a policy choice (drop, block, fail-fast), not a fire to fight at 3 a.m.

Concurrent queues — CAS, sequence numbers, cache lines

CAS, sequence numbers, and the cost of a cache line.

A single-threaded queue is trivial. A queue shared across cores is a small research field. The hardware primitive every production lock-free queue is built on is the compare-and-swap (CAS) instruction — cmpxchg on x86-64, ldxr/stxr on ARM. CAS atomically reads a memory location, compares the value to an expected value, and writes a new value only if the comparison succeeds. With CAS in hand, the textbook lock-free FIFO is the Michael–Scott queue (Maged Michael & Michael Scott, PODC 1996), which uses a sentinel head node and CAS on both head and tail pointers. Java's ConcurrentLinkedQueue and .NET's ConcurrentQueue are implementations of this paper, three decades after publication.

The Treiber stack (R. Kent Treiber, IBM 1986) is the analogous lock-free LIFO: push allocates a node and CAS-installs it as the new head; pop reads head, computes head.next, and CAS-replaces head with next. Both algorithms suffer from the ABA problem: a pointer might be freed and recycled to the same address between a thread reading it and the thread completing its CAS, making the CAS succeed when it shouldn't. The standard fix is a tagged pointer (32-bit pointer plus 32-bit version counter packed into a 64-bit word, or 128-bit CAS via cmpxchg16b) or hazard pointers (Maged Michael, 2004). Java sidesteps the issue with garbage collection — an object can't be recycled while a reference exists.

Dmitry Vyukov's MPMC bounded queue (2011) and SPSC queue are the canonical practical designs, used in Boost, Rust's crossbeam, and many trading-system codebases. The MPMC variant uses a sequence number per slot, eliminating the ABA problem and avoiding pointer chasing. Cliff Click's lock-free hash map (Azul Systems, 2007) introduced the related technique of versioned slots for hash-table updates.

The LMAX Disruptor, built by Martin Thompson, Mike Barker, Dave Farley and Trisha Gee at LMAX between 2009 and 2011 and open-sourced as Apache 2.0 in 2011, takes a different approach: a single ring buffer with monotonic sequence numbers, padded so that producer cursor and consumer cursor live on different cache lines, eliminating false sharing. On a single core LMAX measured 25 million ops per second on 2010-era hardware — roughly an order of magnitude faster than ArrayBlockingQueue at the time. The Disruptor pattern has since shown up in Aeron, parts of Apache Storm, and the Solace messaging fabric. Click's Mechanical Sympathy talks (2010–2012) made the lessons mainstream: data-structure performance is dominated by what the cache coherence protocol forces between cores, not by what the algorithm computes.

// Vyukov MPMC bounded queue (sketch)
struct Slot { atomic&lt;size_t&gt; seq; T data; };
Slot ring[CAPACITY]; // CAPACITY a power of two
atomic&lt;size_t&gt; enq, deq;

bool try_enq(T x) {
 size_t pos = enq.load();
 for (;;) {
 Slot &amp;s = ring[pos &amp; (CAPACITY-1)];
 size_t seq = s.seq.load();
 intptr_t diff = (intptr_t)seq - (intptr_t)pos;
 if (diff == 0 &amp;&amp; enq.compare_exchange_weak(pos, pos+1)) {
 s.data = x;
 s.seq.store(pos + 1); // mark slot as full
 return true;
 }
 if (diff &lt; 0) return false; // queue full
 pos = enq.load();
 }
}

Unbounded growth — the most common queue bug

Unbounded growth is the most common bug.

Most queue-related production incidents reduce to one of three problems: the queue grew without bound, the consumer fell behind without anyone noticing, or two threads contended for the same slot more than the design assumed. The first is fixed by always preferring bounded queues over unbounded ones. Java's LinkedBlockingQueue() default constructor builds an unbounded queue — a footgun. Use the LinkedBlockingQueue(int capacity) overload, or ArrayBlockingQueue, which is bounded by construction.

The second is the backpressure question, and the answer depends on what should happen when the queue is full: drop new messages (suitable for telemetry), drop old messages (suitable for live dashboards), block the producer (suitable for batch jobs that should slow down rather than fail), or fail-fast with an error (suitable for synchronous request handlers where the client can retry). Reactive Streams (the JVM-side specification authored by Lightbend, Netflix, Central, RedHat, Twitter and others in 2015) standardised demand-driven backpressure: the consumer signals how many items it's prepared to receive, and the producer never pushes more. Akka Streams, RxJava, Project Reactor, and Java's java.util.concurrent.Flow all implement this contract.

The third — contention — manifests as throughput plateauing or even regressing as you add cores. The diagnostic is to look at L1 cache miss rates and at the perf c2c output, which tells you which cache lines are bouncing between cores. Mitigations include sharding (one queue per producer–consumer pair), padding (placing different fields on different cache lines so unrelated writes don't invalidate each other), and replacing CAS loops with batched operations (the LMAX claim-then-publish pattern reserves a range of slots in one atomic increment).

A specific category worth naming: the head-of-line blocking hazard. A FIFO queue with one slow consumer makes everyone wait. HTTP/1.1 had this problem at the browser level — one slow image blocking subsequent requests on the same connection — which HTTP/2 (RFC 7540, May 2015) tried to fix with multiplexed streams, only to expose it again at the TCP layer when a single dropped packet stalled all streams sharing the connection. HTTP/3 (RFC 9114, June 2022) finally eliminates head-of-line blocking by moving the transport to QUIC's per-stream reliability. The lesson generalises: a single shared queue between producers of unequal speeds is fragile; per-flow queues with fair scheduling (weighted fair queueing, deficit round robin) are the production answer.

Immutable stacks — and the cost of equality

Immutable stacks and the cost of equality.

The Lisp cons cell — a pair of (head, tail) where the tail is itself a list — is the original purely functional stack. Every prepend creates a new cell pointing at the existing list; the existing list is unchanged. Because nothing is mutated, multiple stacks can share their tails — the structure is persistent in the sense of Driscoll, Sarnak, Sleator, and Tarjan's 1989 paper Making Data Structures Persistent. Push is O(1); pop is O(1); peek is O(1). And because the tail is shared, stacks that branch — useful for backtracking searches, undo histories, and version control — cost only one cell per branch point rather than copying the entire history.

Persistent queues are harder. A naïve immutable queue using two cons-list stacks (one for incoming, one for outgoing, drained on demand) gives amortised O(1) per operation but breaks under persistence: if you keep two references to the same queue and dequeue from each, you pay the drain cost twice. Chris Okasaki's 1996 PhD thesis (and 1998 Cambridge University Press book Purely Functional Data Structures) solved this with scheduled queues that incrementalise the drain so worst-case per operation is O(1), not just amortised. The same book gives O(log n) persistent priority queues and persistent random-access lists. Clojure's persistent data structures (Rich Hickey, 2007) use Phil Bagwell's hash array-mapped tries for O(log₃₂ n) effective constant time on vectors and maps; Scala, F#, and Elm all ship variations on the same theme.

Persistence is not free. The pointer chasing inherent in cons-list stacks costs roughly 100× what an array stack costs in cache misses, and a generational garbage collector has to track which cells survive. But for problem domains where structural sharing matters more than constant-factor speed — compiler intermediate representations, transactional memory snapshots, time-travel debuggers, distributed system state machines — persistent stacks and queues are the right primitive. The Git object database is, viewed sideways, a persistent purely-functional tree of trees; every commit shares all its unchanged subtrees with its parent.

The choice between mutable and persistent ADTs maps almost exactly onto the choice between imperative and functional style. Mutable for the hot path of a network stack, where every byte matters. Persistent for the slow path of a transactional system, where correctness across snapshots matters more.