06 / 14

Internals / 06

Query planner

Selinger and the System R team, 1979. Every cost-based query optimizer since has followed the same skeleton: parse SQL to an AST, rewrite into relational algebra, list the physical plans, score each with a cost model, pick the cheapest. The hard part is the estimates, and they are wrong more often than anyone admits.

From SQL to a tree.

A query starts as a string. The parser turns it into an abstract syntax tree: table refs, column refs, predicates, joins, projections, aggregates. Semantic analysis resolves names against the catalog and checks types. The planner then rewrites the AST into a logical plan, a relational-algebra tree of selections (σ), projections (π), joins (⋈), aggregates (γ).

From the logical plan, the optimizer lists out physical plans. Each operator in the logical tree has several physical forms. A join is nested-loop, hash, or merge; a scan is sequential or index. Each combination is a candidate plan. The cost model scores them. The cheapest survives and goes to the executor.

Selinger's 1979 System R paper laid out this pipeline. Modern engines (Postgres, MySQL 8, SQL Server, Oracle, Spark Catalyst) all use a recognisable descendant.

Logical rewrites: cheap wins before costing.

Before costing anything, the planner applies algebraic rewrites that almost always pay off. They shrink the search space and cut row counts before the expensive operators run.

Predicate pushdown. Push WHERE conditions as close to the scans as possible. Filtering a 100M-row table to 50k before joining beats joining first.
Projection pushdown. Stop reading columns nobody downstream needs. Matters most for columnar stores — see C-Store.
Subquery flattening. Rewrite correlated EXISTS / IN into semi-joins. Postgres flattens most uncorrelated subqueries; correlated ones are still a planning hazard.
View inlining. Substitute the view's definition so the planner sees the whole query.
Common subexpression elimination, constant folding, CASE-WHEN simplification. Plain algebra. It removes work the executor would otherwise repeat per row.

Statistics: the input the cost model trusts.

Every cost-based planner depends on row-count estimates per relation and per predicate. Postgres's ANALYZE samples roughly 300 × default_statistics_target rows per column. At the default target of 100, that's 30,000 rows; at a target of 1000 it climbs to 300,000. The sample feeds two structures: an equi-depth histogram (100 buckets by default) and a list of most-common-values with their frequencies.

From these the planner answers questions like "how many rows will pass WHERE created_at > '2026-01-01'?" by counting the buckets above the boundary. Range queries on a single well-distributed column land within a few percent. Equality on a most-common-value is near-exact. Everything else gets worse.

Run ANALYZE after a bulk load. The planner's worst plans almost always come from stats that haven't been refreshed. Postgres autovacuum triggers ANALYZE after a configurable fraction of the table changes (autovacuum_analyze_scale_factor = 0.1 by default). After a one-shot COPY of millions of rows, run it by hand before the next deploy.

Cardinality estimation: the hard part.

Single-column estimates are manageable. Multi-column correlated predicates are where every optimizer bleeds. The classic example: WHERE city = 'SF' AND state = 'CA'. The naive estimate multiplies selectivities (P(city=SF) × P(state=CA)) and under-counts by orders of magnitude, because the columns are perfectly correlated. Every SF row is in CA.

Engines have layered fixes:

Postgres CREATE STATISTICS. Extended stats for functional dependencies, most-common-values pairs, and n-distinct combinations across column sets.
SQL Server filtered statistics. Histograms scoped by predicate, so a column with skew per tenant doesn't blur into one global distribution.
Oracle dynamic sampling. At plan time the optimizer runs a small query to sample the real selectivity. Costs latency, buys accuracy.

-- Postgres: tell the planner that city implies state
CREATE STATISTICS city_state_stats (dependencies, mcv)
  ON city, state FROM addresses;

ANALYZE addresses;

-- Now the estimate for city='SF' AND state='CA'
-- uses the joint distribution, not the product.
EXPLAIN SELECT * FROM addresses
 WHERE city = 'SF' AND state = 'CA';

The Leis 2015 paper "How Good Are Query Optimizers, Really?" measured estimation error on the JOB benchmark across major systems. Errors of 10× or worse were common, 1000× not unusual on 4-way+ joins. The cost model runs on noisy inputs.

The cost model.

Once cardinalities are estimated, the planner attaches a cost to each operator. Postgres's model is a good example:

GUC	Default	What it represents
`seq_page_cost`	1.0	Cost of a sequential 8KB page read
`random_page_cost`	4.0	Cost of a random 8KB page read
`cpu_tuple_cost`	0.01	Per-row CPU work
`cpu_index_tuple_cost`	0.005	Per-row CPU for index entries
`cpu_operator_cost`	0.0025	Per-call cost of a comparison/function
`effective_cache_size`	4 GB	Hint for cache-aware costing

Each operator combines these. A sequential scan costs seq_page_cost × pages + cpu_tuple_cost × rows. A hash join adds the cost of building the hash table from the smaller input and probing it with the larger. A nested loop scales with the product of inputs unless the inner side is indexed.

On SSDs the default random_page_cost = 4.0 overstates the random-vs-sequential ratio. Most Postgres tunings drop it to 1.1 or 1.5, which nudges the planner toward index scans. The same edit in reverse, raising it, pushes toward seq scans, which can be the right call on heavily-cached small tables.

Join ordering: NP-hard, in practice.

Given n tables, the number of left-deep join trees is n!; bushy trees are worse. Even Selinger's dynamic programming, which prunes by keeping only the best plan per subset, is O(3ⁿ) in time and O(2ⁿ) in memory. At n = 12 that's half a million subsets, still fine. At n = 20, eight billion.

Engine strategies:

Postgres. DP up to geqo_threshold (default 12 joins), then falls back to a genetic algorithm (GEQO) that samples join orders and evolves the best ones. Non-deterministic; the same query can pick different plans across runs.
SQL Server. Cascades framework, transformation rules over a memo structure, timeout-bounded search.
Oracle. Cost-based with multiple "search strategies" (exhaustive, adaptive). Hints (/*+ LEADING(t1 t2) */) override.
Spark Catalyst. Cost-based join reorder up to spark.sql.cbo.joinReorder.dp.threshold (default 12) tables.

If you join 20 tables, you are not getting an optimal plan. You are getting whatever the heuristic happened to find. Materialise intermediates, split into views the planner can cost separately, or hint the join order explicitly.

Join algorithms: three shapes.

Algorithm	Best when	Cost shape	Memory
Nested loop	Outer is small, inner has matching index	O(\|outer\| × log\|inner\|) with index	O(1)
Hash join	Both sides large, equality predicate, fits work_mem	O(\|build\| + \|probe\|)	O(\|build\|)
Merge join	Both sides already sorted on join key	O(\|left\| + \|right\|)	O(1) if pre-sorted

Hash join is the default for big analytic queries because it parallelises cleanly across cores and handles any join order. Nested loop wins on OLTP point lookups, "find the 12 line items for this order" against an indexed order_id. Merge join is rarer because it needs sorted inputs, but it's the natural choice when sorting was free (sort-merge for joins on the clustered index in InnoDB, for example).

Hash join is the default for big OLAP. When you see nested loops over millions of rows in an analytic plan, the planner has under-estimated the inner cardinality. The fix is almost always better statistics, not a hint.

Index selection: the 10% rule of thumb.

An index scan reads index pages, then chases each pointer to a heap page (random I/O). A sequential scan reads heap pages in order. The crossover depends on how many rows you need versus how big the table is. Postgres's rule of thumb: an index scan starts to beat a seq scan once you're touching less than ~10% of the table. The exact number varies with row width, fill factor, and how much of the heap is cached.

Two cases break the rule:

Index-only scans. If every needed column is in the index, the heap isn't touched at all. The crossover shifts much higher; index scans win even at 50%.
Correlated heap. When the index order matches the physical heap order (clustered in Postgres, the primary key in InnoDB), the "random" I/O is actually sequential. The planner accounts for this via the correlation statistic per index.

See the B-tree deep dive for how the index itself is structured.

EXPLAIN ANALYZE: what the planner actually did.

EXPLAIN shows the estimated plan; EXPLAIN ANALYZE runs the query and reports actual row counts and timings beside the estimates. The gap between estimated and actual rows is where every performance investigation starts.

EXPLAIN (ANALYZE, BUFFERS) SELECT o.id, c.name
  FROM orders o JOIN customers c ON c.id = o.customer_id
 WHERE o.created_at > now() - interval '1 day';

 Hash Join  (cost=312.50..18742.10 rows=4200 width=24)
            (actual time=4.812..91.337 rows=58231 loops=1)
   Hash Cond: (o.customer_id = c.id)
   Buffers: shared hit=2104 read=812
   ->  Index Scan using orders_created_at_idx on orders o
           (cost=0.43..18120.00 rows=4200 width=12)
           (actual time=0.084..68.420 rows=58231 loops=1)
         Index Cond: (created_at > (now() - '1 day'::interval))
   ->  Hash  (cost=210.00..210.00 rows=8200 width=20)
             (actual time=4.612..4.613 rows=8200 loops=1)
         ->  Seq Scan on customers c
                (cost=0.00..210.00 rows=8200 width=20)
                (actual time=0.008..2.114 rows=8200 loops=1)
 Planning Time: 0.412 ms
 Execution Time: 94.118 ms

Two things to read first. The rows ratio on every node: here 4200 estimated vs 58231 actual, a 14× under-estimate that means stats on created_at are stale. And Buffers, which separates cache hits from cold reads. A query whose plan is good but read dominates is I/O-bound, not planner-bound.

Always EXPLAIN ANALYZE before declaring a slow query optimised. EXPLAIN alone shows the planner's hopes. ANALYZE shows the truth. A "fast" plan that estimates 100 rows and returns 10M is not optimised. It just hasn't been measured.

When planners go wrong.

Stale statistics. Bulk load, no ANALYZE, next query thinks the table has 0 rows and picks a nested loop over what is now 50M rows. Pager goes off at 3am.
Parameter sniffing. SQL Server and Oracle cache a plan based on the first parameter value they see. If the first call had a selective value (10 rows) and later calls have a broad one (10M rows), the cached nested-loop plan is a disaster. The fix: OPTION (RECOMPILE), plan guides, or optimize_for_unknown.
Correlated predicates. city='SF' AND state='CA' multiplies selectivities and under-estimates 50×, leading the planner to pick a nested loop on the assumption that the join input is tiny.
Join order disasters. A 14-way join hits GEQO, finds a local minimum, and ships a 4-hour plan when a hinted order runs in 40 seconds.
Function calls in predicates. WHERE lower(email) = $1 disables the index on email. The planner sees an opaque expression and falls back to seq scan. The fix: an expression index, or rewrite the predicate.

The 2017 GitLab outage included a query plan flip after a statistics refresh. The new plan was 100× slower and saturated the primary database. The 2010 Knight Capital incident wasn't a planner bug but the same shape: a behaviour change in production that nobody had reproduced in staging.

Adaptive query execution: replanning mid-flight.

If estimates are always going to be wrong, react at runtime. Modern engines ship adaptive execution:

Spark AQE. Between stages, Spark looks at actual shuffle sizes and can switch a sort-merge join to a broadcast hash join, coalesce small partitions, or re-split skewed ones. Off by default before 3.0, on by default since.
SQL Server adaptive joins. The plan ships both nested-loop and hash operators with a threshold; the actual row count from the build side decides at runtime which to use.
Postgres JIT (LLVM). Not strictly adaptive replanning, but runtime-compiled expressions for tuple deforming and predicate evaluation. Triggers once estimated cost exceeds jit_above_cost (default 100,000).
Oracle Adaptive Plans. Operators with statistics collectors switch strategies (hash vs nested loop, distribution method for parallel joins) after seeing real row counts.

Adaptive execution doesn't fix bad statistics, it bounds the damage. The right plan still depends on the right estimates.

Planner traits across engines.

Engine	Search strategy	Stats	Adaptive
Postgres	DP ≤ 12 joins, GEQO above	Histograms + MCV + extended stats	JIT only
MySQL 8	Greedy + exhaustive on small queries	Histograms (since 8.0), no MCV	None
SQL Server	Cascades / memo, timeout-bounded	Histograms + filtered stats	Adaptive joins, batch mode
Oracle	CBO, exhaustive + adaptive	Histograms + dynamic sampling	Adaptive plans, cardinality feedback
Spark (Catalyst)	Rule-based + CBO, DP ≤ 12	Table + column stats via ANALYZE	AQE (re-plans between stages)

Query planner

From SQL to a tree.

Logical rewrites: cheap wins before costing.

Statistics: the input the cost model trusts.

Cardinality estimation: the hard part.

The cost model.

Join ordering: NP-hard, in practice.

Join algorithms: three shapes.

Index selection: the 10% rule of thumb.

EXPLAIN ANALYZE: what the planner actually did.

When planners go wrong.

Adaptive query execution: replanning mid-flight.

Planner traits across engines.

Further reading.

07 — Replication