Query planner
Selinger and the System R team, 1979. Every cost-based query optimizer since has followed the same skeleton: parse SQL to an AST, rewrite into relational algebra, list the physical plans, score each with a cost model, pick the cheapest. The hard part is the estimates, and they are wrong more often than anyone admits.
From SQL to a tree.
A query starts as a string. The parser turns it into an abstract syntax tree: table refs, column refs, predicates, joins, projections, aggregates. Semantic analysis resolves names against the catalog and checks types. The planner then rewrites the AST into a logical plan, a relational-algebra tree of selections (σ), projections (π), joins (⋈), aggregates (γ).
From the logical plan, the optimizer lists out physical plans. Each operator in the logical tree has several physical forms. A join is nested-loop, hash, or merge; a scan is sequential or index. Each combination is a candidate plan. The cost model scores them. The cheapest survives and goes to the executor.
Selinger's 1979 System R paper laid out this pipeline. Modern engines (Postgres, MySQL 8, SQL Server, Oracle, Spark Catalyst) all use a recognisable descendant.
Logical rewrites: cheap wins before costing.
Before costing anything, the planner applies algebraic rewrites that almost always pay off. They shrink the search space and cut row counts before the expensive operators run.
- Predicate pushdown. Push
WHEREconditions as close to the scans as possible. Filtering a 100M-row table to 50k before joining beats joining first. - Projection pushdown. Stop reading columns nobody downstream needs. Matters most for columnar stores — see C-Store.
- Subquery flattening. Rewrite correlated
EXISTS/INinto semi-joins. Postgres flattens most uncorrelated subqueries; correlated ones are still a planning hazard. - View inlining. Substitute the view's definition so the planner sees the whole query.
- Common subexpression elimination, constant folding, CASE-WHEN simplification. Plain algebra. It removes work the executor would otherwise repeat per row.
Statistics: the input the cost model trusts.
Every cost-based planner depends on row-count estimates per relation and per predicate.
Postgres's ANALYZE samples roughly
300 × default_statistics_target rows per column. At the default target of
100, that's 30,000 rows; at a target of 1000 it climbs to 300,000. The sample feeds two
structures: an equi-depth histogram (100 buckets by default) and a list of
most-common-values with their frequencies.
From these the planner answers questions like "how many rows will pass
WHERE created_at > '2026-01-01'?" by counting the buckets above the
boundary. Range queries on a single well-distributed column land within a few percent.
Equality on a most-common-value is near-exact. Everything else gets worse.
ANALYZE after a configurable fraction of the table changes
(autovacuum_analyze_scale_factor = 0.1 by default). After a one-shot
COPY of millions of rows, run it by hand before the next deploy.Cardinality estimation: the hard part.
Single-column estimates are manageable. Multi-column correlated predicates are where every
optimizer bleeds. The classic example: WHERE city = 'SF' AND state = 'CA'.
The naive estimate multiplies selectivities (P(city=SF) × P(state=CA)) and
under-counts by orders of magnitude, because the columns are perfectly correlated. Every
SF row is in CA.
Engines have layered fixes:
- Postgres
CREATE STATISTICS. Extended stats for functional dependencies, most-common-values pairs, and n-distinct combinations across column sets. - SQL Server filtered statistics. Histograms scoped by predicate, so a column with skew per tenant doesn't blur into one global distribution.
- Oracle dynamic sampling. At plan time the optimizer runs a small query to sample the real selectivity. Costs latency, buys accuracy.
-- Postgres: tell the planner that city implies state
CREATE STATISTICS city_state_stats (dependencies, mcv)
ON city, state FROM addresses;
ANALYZE addresses;
-- Now the estimate for city='SF' AND state='CA'
-- uses the joint distribution, not the product.
EXPLAIN SELECT * FROM addresses
WHERE city = 'SF' AND state = 'CA';The Leis 2015 paper "How Good Are Query Optimizers, Really?" measured estimation error on the JOB benchmark across major systems. Errors of 10× or worse were common, 1000× not unusual on 4-way+ joins. The cost model runs on noisy inputs.
The cost model.
Once cardinalities are estimated, the planner attaches a cost to each operator. Postgres's model is a good example:
| GUC | Default | What it represents |
|---|---|---|
seq_page_cost | 1.0 | Cost of a sequential 8KB page read |
random_page_cost | 4.0 | Cost of a random 8KB page read |
cpu_tuple_cost | 0.01 | Per-row CPU work |
cpu_index_tuple_cost | 0.005 | Per-row CPU for index entries |
cpu_operator_cost | 0.0025 | Per-call cost of a comparison/function |
effective_cache_size | 4 GB | Hint for cache-aware costing |
Each operator combines these. A sequential scan costs
seq_page_cost × pages + cpu_tuple_cost × rows. A hash join adds the cost
of building the hash table from the smaller input and probing it with the larger. A
nested loop scales with the product of inputs unless the inner side is indexed.
On SSDs the default random_page_cost = 4.0 overstates the random-vs-sequential
ratio. Most Postgres tunings drop it to 1.1 or 1.5, which nudges the planner toward
index scans. The same edit in reverse, raising it, pushes toward seq scans, which can
be the right call on heavily-cached small tables.
Join ordering: NP-hard, in practice.
Given n tables, the number of left-deep join trees is n!; bushy trees are worse. Even Selinger's dynamic programming, which prunes by keeping only the best plan per subset, is O(3n) in time and O(2n) in memory. At n = 12 that's half a million subsets, still fine. At n = 20, eight billion.
Engine strategies:
- Postgres. DP up to
geqo_threshold(default 12 joins), then falls back to a genetic algorithm (GEQO) that samples join orders and evolves the best ones. Non-deterministic; the same query can pick different plans across runs. - SQL Server. Cascades framework, transformation rules over a memo structure, timeout-bounded search.
- Oracle. Cost-based with multiple "search strategies" (exhaustive,
adaptive). Hints (
/*+ LEADING(t1 t2) */) override. - Spark Catalyst. Cost-based join reorder up to
spark.sql.cbo.joinReorder.dp.threshold(default 12) tables.
Join algorithms: three shapes.
| Algorithm | Best when | Cost shape | Memory |
|---|---|---|---|
| Nested loop | Outer is small, inner has matching index | O(|outer| × log|inner|) with index | O(1) |
| Hash join | Both sides large, equality predicate, fits work_mem | O(|build| + |probe|) | O(|build|) |
| Merge join | Both sides already sorted on join key | O(|left| + |right|) | O(1) if pre-sorted |
Hash join is the default for big analytic queries because it parallelises cleanly across
cores and handles any join order. Nested loop wins on OLTP point lookups,
"find the 12 line items for this order" against an indexed
order_id. Merge join is rarer because it needs sorted inputs,
but it's the natural choice when sorting was free (sort-merge for joins on the clustered
index in InnoDB, for example).
Index selection: the 10% rule of thumb.
An index scan reads index pages, then chases each pointer to a heap page (random I/O). A sequential scan reads heap pages in order. The crossover depends on how many rows you need versus how big the table is. Postgres's rule of thumb: an index scan starts to beat a seq scan once you're touching less than ~10% of the table. The exact number varies with row width, fill factor, and how much of the heap is cached.
Two cases break the rule:
- Index-only scans. If every needed column is in the index, the heap isn't touched at all. The crossover shifts much higher; index scans win even at 50%.
- Correlated heap. When the index order matches the physical heap order
(
clustered in Postgres, the primary key in InnoDB), the "random" I/O is actually sequential. The planner accounts for this via thecorrelationstatistic per index.
See the B-tree deep dive for how the index itself is structured.
EXPLAIN ANALYZE: what the planner actually did.
EXPLAIN shows the estimated plan; EXPLAIN ANALYZE runs the
query and reports actual row counts and timings beside the estimates. The gap between
estimated and actual rows is where every performance investigation starts.
EXPLAIN (ANALYZE, BUFFERS) SELECT o.id, c.name
FROM orders o JOIN customers c ON c.id = o.customer_id
WHERE o.created_at > now() - interval '1 day';
Hash Join (cost=312.50..18742.10 rows=4200 width=24)
(actual time=4.812..91.337 rows=58231 loops=1)
Hash Cond: (o.customer_id = c.id)
Buffers: shared hit=2104 read=812
-> Index Scan using orders_created_at_idx on orders o
(cost=0.43..18120.00 rows=4200 width=12)
(actual time=0.084..68.420 rows=58231 loops=1)
Index Cond: (created_at > (now() - '1 day'::interval))
-> Hash (cost=210.00..210.00 rows=8200 width=20)
(actual time=4.612..4.613 rows=8200 loops=1)
-> Seq Scan on customers c
(cost=0.00..210.00 rows=8200 width=20)
(actual time=0.008..2.114 rows=8200 loops=1)
Planning Time: 0.412 ms
Execution Time: 94.118 msTwo things to read first. The rows ratio on every node: here
4200 estimated vs 58231 actual, a 14× under-estimate that
means stats on created_at are stale. And Buffers, which
separates cache hits from cold reads. A query whose plan is good but
read dominates is I/O-bound, not planner-bound.
EXPLAIN alone shows the planner's hopes. ANALYZE shows the
truth. A "fast" plan that estimates 100 rows and returns 10M is not optimised. It just
hasn't been measured.When planners go wrong.
- Stale statistics. Bulk load, no
ANALYZE, next query thinks the table has 0 rows and picks a nested loop over what is now 50M rows. Pager goes off at 3am. - Parameter sniffing. SQL Server and Oracle cache a plan based on the
first parameter value they see. If the first call had a selective value (10 rows) and
later calls have a broad one (10M rows), the cached nested-loop plan is a disaster.
The fix:
OPTION (RECOMPILE), plan guides, oroptimize_for_unknown. - Correlated predicates.
city='SF' AND state='CA'multiplies selectivities and under-estimates 50×, leading the planner to pick a nested loop on the assumption that the join input is tiny. - Join order disasters. A 14-way join hits GEQO, finds a local minimum, and ships a 4-hour plan when a hinted order runs in 40 seconds.
- Function calls in predicates.
WHERE lower(email) = $1disables the index onemail. The planner sees an opaque expression and falls back to seq scan. The fix: an expression index, or rewrite the predicate.
The 2017 GitLab outage included a query plan flip after a statistics refresh. The new plan was 100× slower and saturated the primary database. The 2010 Knight Capital incident wasn't a planner bug but the same shape: a behaviour change in production that nobody had reproduced in staging.
Adaptive query execution: replanning mid-flight.
If estimates are always going to be wrong, react at runtime. Modern engines ship adaptive execution:
- Spark AQE. Between stages, Spark looks at actual shuffle sizes and can switch a sort-merge join to a broadcast hash join, coalesce small partitions, or re-split skewed ones. Off by default before 3.0, on by default since.
- SQL Server adaptive joins. The plan ships both nested-loop and hash operators with a threshold; the actual row count from the build side decides at runtime which to use.
- Postgres JIT (LLVM). Not strictly adaptive replanning, but
runtime-compiled expressions for tuple deforming and predicate evaluation. Triggers
once estimated cost exceeds
jit_above_cost(default 100,000). - Oracle Adaptive Plans. Operators with statistics collectors switch strategies (hash vs nested loop, distribution method for parallel joins) after seeing real row counts.
Adaptive execution doesn't fix bad statistics, it bounds the damage. The right plan still depends on the right estimates.
Planner traits across engines.
| Engine | Search strategy | Stats | Adaptive |
|---|---|---|---|
| Postgres | DP ≤ 12 joins, GEQO above | Histograms + MCV + extended stats | JIT only |
| MySQL 8 | Greedy + exhaustive on small queries | Histograms (since 8.0), no MCV | None |
| SQL Server | Cascades / memo, timeout-bounded | Histograms + filtered stats | Adaptive joins, batch mode |
| Oracle | CBO, exhaustive + adaptive | Histograms + dynamic sampling | Adaptive plans, cardinality feedback |
| Spark (Catalyst) | Rule-based + CBO, DP ≤ 12 | Table + column stats via ANALYZE | AQE (re-plans between stages) |
Further reading.
- Selinger et al. (1979) — Access Path Selection in a Relational DBMS The System R paper. Every cost-based optimizer descends from here.
- Leis et al. (2015) — How Good Are Query Optimizers, Really? The JOB benchmark. Quantifies how badly estimates degrade past 4-way joins.
- Postgres docs — Planner / Optimizer What each GUC means and how the planner uses it.
- Postgres CREATE STATISTICS Extended stats for correlated columns.
- CMU 15-721 — Advanced Database Systems (Pavlo) The query optimization lectures are the best modern survey.
- Semicolony — C-Store paper notes Projection pushdown matters most when the storage layer is columnar.
- Semicolony — B-trees deep dive The index structure the planner is choosing whether to use.
- CMU 15-445 — Andy Pavlo's full lecture series The query planning and execution lectures are essential viewing.