What is the fastest sorting algorithm?

There's no single answer. For comparison-based sorting on average input: pdqsort (Rust default), introsort (C++ std::sort), Timsort (Python, Java) — all hybrids that pick a strategy per partition. For nearly-sorted input: insertion sort or Timsort wins. For known small ranges: counting sort or radix sort can beat n log n. The right answer depends on input characteristics, language, and whether stability matters.

Why does Python use Timsort?

Timsort (Tim Peters, 2002) is a hybrid merge/insertion sort that exploits naturally occurring runs in real-world data. It's stable, adaptive (O(n) on already-sorted input), and benchmarks well across diverse workloads. Python adopted it in 2.3; Java in 7; many languages have followed. The name is sticky enough that the algorithm is now usually identified by its inventor.

Sorting Algorithm Visualizer: four sorts, same array.

A sorting visualizer animates how an algorithm rearranges an array, counting every comparison and swap so the cost shows up as motion. Run bubble, quick, merge, and heap on the same data and the Big-O gap turns physical. Bubble crawls; quick races; merge moves in waves.

Compares

Swaps

Algorithm

Size

Array

What you're looking at

Each bar is one array element, its height the value, left to right in current array order. The two highlighted bars mark the pair the algorithm is comparing or moving right now, so the animation is a live trace of the chosen sort working through the data. The meta row counts comparisons and swaps as they happen — the concrete cost behind the big-O label on each algorithm button.

Run Bubble at size 96, then run Quick on the same size and watch the comparison counter. Bubble grinds out thousands of comparisons while the order barely changes near the start; Quick finishes in a fraction of that. The shapes differ too: bubble and insertion creep one position at a time, quick flings elements across the array as pivots partition, and merge resolves in clean blocks from small runs upward. What should surprise you is how violently the n-squared and n-log-n counts diverge once the array gets large — the same visible bars, wildly different work.

What is a sorting algorithm?

Why a million names can't be sorted by hand.

A sorting algorithm arranges items in a defined order. The classical comparison-based algorithms — bubble, insertion, merge, quick, heap — share an Ω(n log n) lower bound; non-comparison sorts (counting, radix, bucket) can beat it for restricted inputs. Donald Knuth's The Art of Computer Programming, Volume 3 (1973) is still the canonical reference. Today's standard libraries ship hybrids: Timsort (Python, Java), pdqsort (Rust), introsort (C++).

Imagine you have a stack of a thousand index cards, each with a name. Your job is to put them in alphabetical order. The natural way to do it: pick the smallest, put it down; pick the smallest of the rest, put it down on top; repeat. With a thousand cards you make a thousand passes and look at every remaining card on every pass. That is roughly half a million comparisons. Annoying, but doable in an afternoon.

Now scale up. A million names. The same algorithm performs five hundred billion comparisons. At a billion comparisons per second on a modern laptop that would take eight minutes; on the constraints of an old machine, hours. A billion names? Five years. The strategy "pick the smallest, put it down" is mathematically correct but it does not scale. The number of operations grows with the square of the input size — what computer scientists write as O(n²) — and at any input size larger than a few thousand it stops being usable.

This is the central problem in sorting: not whether you can sort the data but how the cost grows as the data does. The sorting algorithms in this simulator are different answers to that question. Bubble and insertion sort are the natural-instinct answers — pick adjacent items, swap them if they are out of order — and they are O(n²). Merge sort and quicksort use a divide-and-conquer trick: split the array in half, sort each half, combine. The trick costs O(n log n), which means a million names take twenty million operations instead of five hundred billion — a twenty-five-thousand-fold speedup. On a laptop, milliseconds instead of minutes.

The simulator above lets you watch all four algorithms on the same array. Bubble crawls — its bars shuffle one position at a time and progress is visible across the whole row. Insertion is similar but a little smarter; an already-sorted prefix grows on the left. Quick races — pivots fly, large chunks rearrange in single moves. Merge moves in waves — sub-arrays merge from the leaves of a recursion tree upward. The shape of each algorithm is its own.

The deeper question — and the one that produces the lower bound on every comparison-based sort — is whether you can do better than O(n log n). The answer is no, except by escaping the comparison model entirely. The next parts work through both halves of that story.

Origins — punched-card mergers to Knuth's third volume

From punched-card mergers to Knuth's third volume.

Sorting predates the digital computer. Hollerith's tabulating machines for the 1890 US census used a sorting box that read punched cards and routed them into bins by hand-set criteria, and the punched-card sort-merge was a working algorithm before the term "algorithm" had its modern meaning. By the time the first stored-program machines arrived in the late 1940s, sort-merge was the dominant batch-processing primitive; large business workloads were dominated by sorting-the-input and merging-the-output. The motivating problem was tape-based: data lived on magnetic tape and could only be read sequentially, so any sort had to be expressible as a sequence of forward passes plus merges of intermediate sorted runs.

The internal algorithms appeared one by one through the 1950s and 1960s. Insertion sort and selection sort were folklore — the obvious things to do — and Friend's 1956 paper formalised them. Merge sort was attributed to von Neumann around 1945 in unpublished notes, with the first formal write-up in Goldstine and von Neumann's report on sorting on the EDVAC. Quicksort was invented by Tony Hoare in 1959 while at the University of Moscow; he was building a Russian-English translation system that needed to look up dictionary entries quickly, and he wanted a sort that ran fast on small machines. Heapsort arrived in 1964 in Williams's paper introducing the binary heap.

Knuth's The Art of Computer Programming, Volume 3: Sorting and Searching, published in 1973, became the canonical reference. The volume catalogued every then-known sort with detailed analyses of comparison counts, swap counts, memory usage, and behaviour on tape and disk-based storage. Knuth's contribution was less the algorithms — almost all of which existed before he began writing — than the framework for comparing them and the careful empirical analyses on the realistic memory hierarchies of the day. The framework still applies; the constant factors have changed by orders of magnitude, but the asymptotic landscape Knuth mapped is essentially the modern one.

The next generation of work focused on engineering rather than asymptotic. Sedgewick's 1977 PhD thesis on Quicksort produced a tuned implementation with median-of-three pivot selection, a small-array cutoff to insertion sort, and careful attention to the constant factors that made the difference between a teaching example and a library implementation. Bentley and McIlroy's 1993 paper Engineering a Sort Function documented the design of the C standard library's qsort and remains the best short essay on the difference between a textbook algorithm and a deployable one. Both pieces are still required reading for anyone working on a serious sort.

Comparison-sort lower bound — n log n is a hard floor

No comparison sort can be faster than n log n.

The decision-tree argument is one of the cleanest results in algorithm theory. Treat any comparison-based sorting algorithm as a binary tree: each internal node is a single comparison a < b, the left and right children correspond to the two possible answers, and each leaf records the permutation the algorithm has decided the input was. Because the algorithm has to be correct on every input, every one of the n! permutations must appear as some leaf. A binary tree with n! leaves has height at least log₂(n!), which by Stirling's approximation is roughly n log n. Therefore any comparison sort, on its worst-case input, performs at least that many comparisons.

The bound applies whether the algorithm is randomised or deterministic. Randomised sorts can avoid the specific bad inputs that hurt deterministic ones — quicksort with a random pivot has expected O(n log n) behaviour over the random choices regardless of the input — but the bound holds in expectation as well. The bound is also tight: mergesort and heapsort achieve O(n log n) in the worst case, so the lower bound and the upper bound meet, and there is no further asymptotic progress to be had on comparison sorts. Half a century of research has produced better constants and better cache behaviour, but not a faster comparison sort.

The escape route is to stop comparing. If the keys have structure — if they are integers in a known range, or strings of bounded length, or any other shape that gives the algorithm more information than just a comparison oracle — sublinear-comparison sorts become possible. Counting sort runs in O(n + k) for keys drawn from a range of size k; the algorithm allocates one bucket per possible key, walks the input incrementing buckets, and walks the buckets emitting keys. Radix sort generalises this to multi-digit keys, sorting by least-significant digit first using a stable sub-sort and iterating through the digits. Bucket sort assumes the input is uniformly distributed and partitions into buckets that are sorted independently, achieving expected linear time on the right inputs.

Linear-time sorts are not free. They require additional memory proportional to the key range, they only beat n log n when the per-key digit count is small relative to log n, and they often have terrible cache behaviour because the bucket array is large and randomly accessed. The standard library defaults to a comparison sort precisely because the comparison sort works for arbitrary key types; the linear-time sorts are domain-specific weapons that show up where the keys cooperate. GPU radix sort in CUDA Thrust, Intel's TBB parallel sort, and the Intel one-API library's distribution sort all use radix or bucket variants because the workloads they target — fixed-width numeric keys at very high throughput — fit the assumptions.

Stability and adaptivity — two properties that change which sort to pick

Two properties that change which sort to pick.

A sort is stable if elements with equal keys appear in the output in the same relative order they had in the input. Stability is invisible when keys are unique but matters whenever the secondary order is meaningful. Sorting a list of orders by date, then by customer, requires a stable sort on the second pass to preserve the date ordering within each customer group; an unstable sort will reshuffle the ties and lose the date information. The classical examples of stable sorts are insertion sort, merge sort, and bubble sort; quicksort and heapsort are unstable in their textbook forms.

The cost of stability is real. A stable in-place sort is harder to implement than an unstable in-place sort, because the obvious in-place partitioning swap reorders equal-key elements arbitrarily. Several stable in-place algorithms exist — Pratt's algorithm, block sort, the merging routine in Wikisort — but they have significantly worse constants than the unstable in-place sorts they compete with. Most production stable sorts therefore use auxiliary memory; merge sort with a temporary buffer is the canonical example, and Timsort is its most refined descendant.

A sort is in-place if it uses only constant additional memory beyond the input array (or, looser, O(log n) for the recursion stack). In-place matters when the input is large enough that a second copy is unaffordable, when memory bandwidth is the bottleneck, or when the data lives in a memory region that cannot easily be allocated alongside. Quicksort is in-place in this looser sense; mergesort is not, requiring O(n) auxiliary memory; heapsort is strictly in-place, with O(1) additional memory. The choice between mergesort and quicksort in a standard library often comes down to whether the language wants a stable sort (mergesort wins) or an in-place sort (quicksort wins).

A third axis is whether the algorithm is adaptive — whether it runs faster on partially-ordered input than on random input. Insertion sort is adaptive: O(n) on already-sorted data, O(n²) on reverse-sorted data. Mergesort is not adaptive in its textbook form, but Timsort exploits "natural runs" of monotonic sub-sequences to give adaptivity at the merge level. Real workloads are often partially sorted — log records by time, version numbers, file paths — so adaptivity translates directly into measurable wins. The price is a slightly more complex algorithm; the benefit is significant for the workloads that hit it.

Algorithm	Avg / worst	Memory	Stable	Adaptive
Bubble	n² / n²	O(1)	yes	yes
Insertion	n² / n²	O(1)	yes	yes — O(n) on sorted
Merge	n log n / n log n	O(n)	yes	no (textbook)
Quick	n log n / n²	O(log n) stack	no	no
Heap	n log n / n log n	O(1)	no	no
Timsort	n log n / n log n	O(n)	yes	yes — natural runs
Radix	d·n / d·n	O(n + k)	yes	no — non-comparison

Hybrid sorts in production — Timsort, pdqsort, introsort

Timsort, pdqsort, introsort.

Timsort was designed by Tim Peters in 2002 for the CPython runtime. The algorithm is a hybrid of mergesort and insertion sort that explicitly looks for "natural runs" — already-sorted ascending or descending sub-sequences — in the input. Each run is identified with a single linear scan; very short runs are extended with insertion sort to a configurable minimum size; the runs are then merged using a stack-based merging policy that maintains an invariant on run lengths to bound the merge depth. Timsort is the default sort in Python's list.sort and sorted, in Java's Arrays.sort on object arrays since Java 7, and in V8 (the JavaScript engine in Chrome and Node.js) since 2018. The original design notes in listsort.txt in the CPython source remain the most readable description.

Pdqsort — pattern-defeating quicksort — was introduced by Orson Peters in 2014 and is a tuned quicksort with several enhancements. It uses median-of-three or ninther pivot selection to avoid the worst-case O(n²) behaviour on near-sorted inputs; it falls back to heapsort if the recursion depth grows beyond a threshold (an introsort-style fallback); and it uses block partitioning to keep the partition step cache-friendly and branch-prediction-friendly. Pdqsort is the default unstable sort in Rust's slice::sort_unstable, in Go's sort package since Go 1.19, and in several C++ libraries.

Introsort, by David Musser in 1997, is the algorithmic ancestor of pdqsort and remains the default in C++'s std::sort. Introsort begins as quicksort, monitors the recursion depth, and switches to heapsort if the depth exceeds a logarithmic bound. The switch guarantees worst-case O(n log n) behaviour while preserving quicksort's constant factor on the common case. Most modern std::sort implementations layer additional optimisations on top — small-array cutoff to insertion sort, branchless partition routines — but the introsort backbone remains.

Beyond these three, several specialised sorts show up in production. Spreadsort hybridises radix and quicksort for integer and string keys. Ska_sort, by Malte Skarupke, uses American flag sort and beats std::sort by a substantial factor on random integer input. The parallel sort in Intel's TBB and oneAPI uses parallel mergesort with work-stealing; the GPU radix sort in CUDA Thrust hits hundreds of gigabytes per second on the right hardware. The choice depends on the data shape, the memory budget, the parallel hardware available, and the time the team has to integrate something more specialised than the language default.

# CPython listobject.c — minrun computation (Tim Peters)
def merge_compute_minrun(n):
    """
    Pick a minrun in [32, 64] such that n / minrun is just below
    a power of 2. This balances the merge tree depth.
    """
    r = 0                # becomes 1 if any 1 bits are shifted off
    while n >= 64:
        r |= n & 1
        n >>= 1
    return n + r
# Short runs are extended to minrun via binary insertion sort,
# then the merge stack invariant maintains: |X| > |Y| + |Z|, |Y| > |Z|.

External sorting — when data doesn't fit in memory

Sorting more data than fits in memory.

Internal sorts assume the input fits in memory. When it does not, the algorithm has to be reformulated to work in fixed-size chunks streamed from disk. The classical approach is external mergesort: read a memory-sized chunk, sort it internally, write it back to disk as a sorted run; repeat until every byte of input has been processed; then merge the runs in a tournament tree, reading a small buffer from each run and writing the merged output sequentially. The number of passes is the depth of the merge tree, which is logarithmic in the run count.

The classical analysis turns on minimising disk I/O. Every read and every write is a fixed cost set by the storage medium — milliseconds on rotating disk, microseconds on NVMe SSD — and the algorithm spends almost all of its time waiting for I/O rather than computing. This is why a sort that does k merge passes on a dataset of n bytes performs roughly k × n / B sequential I/O operations where B is the page size, and why the engineering effort goes into making k as small as possible. Modern external sorts use a wide-fanout merge to keep the tree shallow, often combining many input runs in a single pass at the cost of a larger memory buffer per run.

Parallel sorting on multi-core machines uses the same divide-and-conquer skeleton but with parallel sort of the chunks and parallel merge of the partial results. The parallel mergesort implementations in Intel TBB and oneAPI partition the input into one chunk per worker, sort each chunk locally, and merge with a parallel-merge primitive that itself parallelises the merge. The overhead is the synchronisation between phases; on shared-memory machines the speedup is sub-linear in the core count because the merge step is harder to parallelise than the local-sort step. Parallel quicksort exists too, but the load-balancing problem is harder because the recursive partition produces unequal sub-arrays.

GPU sorting goes further. Modern radix sort implementations on a high-end GPU achieve sustained throughputs that dwarf any CPU sort by several orders of magnitude on integer keys. The CUDA Thrust library, the modern CUB primitives, and the work coming out of NVIDIA's research labs continually push the constant factor down. The price is that the keys must be GPU-resident and of a shape the GPU sort understands; arbitrary comparator-based sorts on heterogeneous objects do not fit. For the workloads that do fit — scientific computing, ranking systems, sort-then-aggregate database engines — the GPU sort is the right primitive.

Sort once, query many

External sort's I/O cost dominates everything else. The trick is to make the sort once, persist the result, and answer many queries against the sorted file. This is why log-structured merge trees and column-store segments are sorted on insert: the cost is amortised over thousands of subsequent point and range lookups.

The engineering behind a fast quicksort

The engineering behind a fast quicksort.

Quicksort's worst case is famously bad: a pivot that is always the minimum or maximum of the partition turns the algorithm into O(n²). The textbook fix is randomisation, picking the pivot uniformly at random from the partition; this gives expected O(n log n) over the algorithm's coin flips regardless of the input. The practical fix is median-of-three: pick the median of the first, middle, and last elements as the pivot. This handles the common case of already-sorted or reverse-sorted input cheaply and avoids the cost of generating random numbers in the inner loop. Sedgewick's tuned quicksort used median-of-three; Bentley-McIlroy's qsort used median-of-nine for larger partitions.

A second cluster of optimisations targets the partition itself. The classical Hoare partition uses two indices that scan inward from the ends, swapping out-of-place pairs; the Lomuto partition uses a single index that walks left-to-right, swapping smaller-than-pivot elements forward. Hoare partitions are slightly more efficient on average but harder to implement correctly. Pdqsort and the modern fast quicksorts use block partitioning: process a fixed-size block at a time, computing the comparison results into a small bitmap and then doing the swaps as a separate phase. The reason this is faster is that it eliminates branch mispredictions inside the partition loop; the comparison and the swap are decoupled, the bitmap is computed branchlessly, and the modern CPU pipelines through the result. Edelkamp and Weiss's BlockQuicksort paper laid the groundwork.

A third cluster targets the recursion. Tail-call optimisation on the larger partition keeps the recursion depth to O(log n) in the worst case; introsort's switch to heapsort at deep recursion depths bounds the worst case at O(n log n); small-partition cutoff to insertion sort handles the leaves of the recursion much faster than recursive partitioning would. Most production quicksorts cut over to insertion sort for partitions smaller than around twenty elements, because insertion sort's per-element constants are dramatically lower at that size despite the worse asymptotic.

The cumulative effect of these optimisations is a quicksort that is two to four times faster than a textbook implementation on random input and significantly faster than a textbook mergesort on the same data, while being competitive with or better than a stable sort on partially-sorted input. Pdqsort, the modern flagship of this lineage, runs at a few nanoseconds per element on modern CPUs for small fixed-width keys, with throughput limited primarily by memory bandwidth rather than by computation. The textbook quicksort and the production pdqsort have the same algorithmic skeleton but very different performance profiles.

Selection, partial sort, and structures that beat full sorting

Selection, partial sort, and structures that beat full sorting.

Many problems that look like sorting problems have cheaper specialised algorithms. Selection — finding the k-th smallest element — can be done in expected O(n) time with quickselect, and in worst-case O(n) with the median-of-medians algorithm. If you only need the k smallest elements rather than the full sorted order, partial sort or heap-based selection beats a full sort by a substantial factor. C++'s std::partial_sort and Python's heapq.nsmallest are the standard library expressions of this idea.

Order statistics from streaming data avoid the need to materialise a sorted dataset at all. T-digests, KLL sketches, and the older Greenwald-Khanna algorithm compute approximate quantiles online with bounded error in sublinear memory; the trade is that the answer is approximate, but for monitoring and alerting workloads the approximation is often accurate enough that paying the full sort cost is wasteful. Postgres' percentile_disc and percentile_cont functions sort, but several extensions add streaming-quantile types that do not.

For workloads where the data is conceptually ordered but is being maintained dynamically — a leaderboard, an order book, a priority queue — a sorted data structure beats repeated sorting. A balanced binary search tree (red-black, AVL), a skip list, a B-tree, or a heap each maintains order on insertions and deletions in O(log n) per operation, far cheaper than re-sorting the dataset on every change. The choice between them depends on the access pattern: heaps are fastest for "give me the top few" workloads, BSTs and B-trees are best for range queries and ordered traversal, skip lists win for concurrent write workloads where the simpler algorithms reduce lock contention.

Finally, sometimes the right answer is not to sort. A hash join in a database is faster than a sort-merge join when the inner relation fits in memory; a hash aggregation beats a sort-and-group when group counts are small relative to the input. Modern columnar query engines pick between sort-based and hash-based plans dynamically based on cardinality estimates. The point is that sorting is a means to an end — usually because the downstream consumer benefits from sorted data — and the overall pipeline cost matters more than the cost of any individual step. A careful query planner asks whether the downstream actually needs sorted output before paying for the sort.