SQL Query Execution Simulator: EXPLAIN ANALYZE, animated.
Pick a query. Watch the planner build a tree, then watch tuples flow through each operator. Toggle indexes and table size, and watch the plan flip between Nested Loop and Hash Join: the move that makes or breaks a query at scale.
SELECT o.id, u.email FROM orders o JOIN users u ON u.id = o.user_id WHERE o.total > 500;
You tell SQL what you want; the planner figures out how
One SELECT statement, potentially hundreds of execution plans.
A SQL query says what you want, not how to get it. The database's planner reads the parsed SQL, looks at the schema, looks at the table statistics, and produces an execution plan: a tree of operators. That tree is what actually runs. Two queries that look identical can run a thousand times apart because of how the planner shaped the tree.
The simulator above lets you toggle the conditions (index present, table size, stats freshness) and watch the plan change. The same SELECT can become a one-page index lookup or a multi-million-row sequential scan depending on those toggles. Real production databases have hundreds of such toggles: index existence, statistics distribution, planner configuration parameters, available memory.
The skill the simulator builds is the ability to look at EXPLAIN ANALYZE output and understand what each line means, why the planner chose it, and what would have to change for it to choose something else.
The vocabulary of plan trees
Seq Scan
Read every page of a table from disk, return all tuples (optionally filtering after read). O(N) where N is table size. The right choice when most of the table matches (or when there's no index to help).
Index Scan
Walk a B-tree index to find matching keys, then fetch the corresponding heap tuples. O(log N + matches). Better when the predicate is selective (matches a small fraction of the table).
Index-Only Scan
Same as Index Scan but the query needs only columns that are in the index, so no heap fetch is required. The fastest scan when applicable.
Bitmap Scan
Two-step: first the index is walked to build a bitmap of matching pages, then those pages are scanned in physical order (more cache-friendly than random access). Good for medium-selectivity predicates.
Nested Loop
For each tuple from the outer table, scan the inner table. O(N × M) in the worst case. Fast when the outer is small or the inner has a useful index for the join key.
Hash Join
Build a hash table from one side (usually the smaller), then probe with each tuple from the other side. O(N + M). Standard for equi-joins on medium-to-large tables.
Merge Join (Sort-Merge)
Both inputs sorted by the join key, then walked in lockstep. O(N + M) after the sort. Best when inputs come pre-sorted (e.g., from index scans on the join key), worse if a sort is required first.
HashAggregate
Build a hash table keyed by GROUP BY columns, accumulate per-group aggregates. O(N). Used when GROUP BY values fit in memory.
GroupAggregate
Used when input is already sorted by the GROUP BY key: single pass, no hash table. Lower memory than HashAggregate.
Sort
ORDER BY when no useful index. O(N log N), in-memory if possible, falls back to external merge sort on disk for large inputs.
Limit
Cap on rows returned. Sometimes the planner can stop work early; sometimes it has to materialise everything first (e.g., before a Sort) before Limit applies.
Materialize
Cache the output of a subplan in memory so it can be re-scanned by a parent operator (typically for the inner side of a Nested Loop where re-execution is too expensive).
Append
Concatenate output from multiple child plans. Used for UNION ALL and for partitioned tables.
The most-misread output in databases
How to extract the right information at a glance.
EXPLAIN ANALYZE
SELECT o.id, u.email
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.total > 500;
QUERY PLAN
---------------------------------------------------------------------------------
Hash Join (cost=12.50..1834.20 rows=2000 width=44)
(actual time=0.412..89.234 rows=1847 loops=1)
Hash Cond: (o.user_id = u.id)
-> Seq Scan on orders o (cost=0.00..1820.00 rows=2000 width=12)
(actual time=0.024..82.103 rows=1847 loops=1)
Filter: (total > 500)
Rows Removed by Filter: 98153
-> Hash (cost=10.00..10.00 rows=2000 width=36)
(actual time=0.378..0.378 rows=2000 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 152kB
-> Seq Scan on users u (cost=0.00..10.00 rows=2000 width=36)
(actual time=0.012..0.198 rows=2000 loops=1)
Planning Time: 0.342 ms
Execution Time: 89.890 ms Read it bottom-up to see what runs first, top-down to see what wraps everything. Each operator line has two cost numbers: estimated cost (units arbitrary, lower=cheaper) and estimated rows. EXPLAIN ANALYZE adds actual time and actual rows after running the query.
The single most important comparison: estimated rows vs actual rows. If the planner thought 10 rows and got 10 million, every choice it made downstream was wrong: wrong join algorithm, wrong sort method, wrong memory allocation. The fix is almost always running ANALYZE to refresh statistics.
Other things to scan for: Rows Removed by Filter (work done that produced nothing), Memory Usage (does the operator fit in work_mem or spill to disk?), Buckets and Batches (hash table efficiency), Workers Planned vs Launched (parallel query effectiveness).
The same query, vastly different plans
Try toggling the table size in the simulator on query #3 (JOIN two tables). With small or medium tables and an index on users, the planner picks Nested Loop. Crank to large and it switches to Hash Join. The planner has decided that the per-row cost of probing the index N million times exceeds the cost of building a hash table once.
This is the cost-based optimiser at work. For each candidate plan, it computes estimated cost using the table statistics. Then it picks the cheapest. The set of candidates can be huge, so the planner uses dynamic programming or, beyond a threshold, a genetic algorithm (Postgres) or heuristics to prune.
The same flip happens for filtering: with a selective predicate and an index, you get Index Scan + Filter. With a non-selective predicate, the planner switches to Seq Scan because reading every page sequentially is faster than the random I/O of jumping around following the index.
And it happens for aggregation: HashAggregate when the group count fits memory; GroupAggregate (after Sort) when it doesn't, or when input is already sorted; the work_mem setting controls the threshold.
Which indexes actually help
The discipline that most teams don't take seriously enough.
Single-column B-tree
The default and most useful. Works for equality (= , IN), range (<, >, BETWEEN), and prefix LIKE ('foo%'). Used by almost every query against the column. Cost: roughly the size of the table's column plus pointer overhead. Maintenance cost on every UPDATE/INSERT/DELETE.
Composite (multi-column) B-tree
Index on (a, b, c). Useful for queries that filter on a, on (a, b), or on (a, b, c): the leftmost-prefix rule. Not useful for queries that filter only on b or c. Order matters; put the most selective and most frequently filtered column first.
Partial index
Index that includes only rows matching a predicate: CREATE INDEX ON orders(user_id) WHERE status = 'paid'. Smaller, faster, but only usable when the query has the same predicate. Great for hot subsets of large tables.
Expression index
Index on an expression like LOWER(email) or date_trunc('day', created_at). Lets queries that use the same expression hit the index. Without it, the expression has to be evaluated per row and indexes are useless.
Covering index (INCLUDE)
Index that includes non-key columns so the heap doesn't need to be visited (Index-Only Scan). Pay storage for index size, save random I/O on read.
Hash index
Equality only, no ordering. Postgres made hash indexes WAL-logged in 10, so they're usable in production. Rarely better than B-tree because B-tree handles equality nearly as well.
GIN
Generalized inverted index. For full-text search, JSONB queries (@>, ?), array contains. Slower to build/update, fast to read, large.
GiST
Generalized search tree. For range types, geometric, full-text with custom configurations, trigram similarity. Slower than B-tree but handles types B-tree cannot.
BRIN
Block range index. Stores summary information per block range (min/max for the block range). Tiny: gigabytes of data for a few MB of index. Useful for huge, naturally-ordered tables (time-series, append-only logs).
When NOT to add an index
Small tables (fewer than ~1000 rows): Seq Scan is faster than indirect lookup. Columns with very low cardinality (boolean, gender): Seq Scan plus filter is usually better. Tables with very high write volume where the index maintenance cost exceeds the read savings. Predicates that are never selective.
Three families, three trade-offs
| Algorithm | Time | Memory | Best when |
|---|---|---|---|
| Nested Loop | O(N × M) — O(N log M) with index on inner | O(1) | Small outer, indexed inner. Smallest tables. LATERAL joins. |
| Hash Join | O(N + M) | O(min(N, M)) | Medium-to-large tables, equi-join, both sides fit (or spill manageable). |
| Sort-Merge Join | O(N log N + M log M) — O(N + M) if pre-sorted | O(N + M) if sorting | Both inputs pre-sorted by join key. Range joins (using merge join modifications). |
The planner picks based on cost. For small joins, Nested Loop wins on memory and avoids the build phase. For larger joins, Hash Join's linear time wins. For pre-sorted inputs (often the case after Index Scan on the join key), Sort-Merge avoids both rebuilding hashes and N×M probes.
Outer joins
LEFT/RIGHT/FULL OUTER add the requirement to emit NULL-extended tuples for non-matches. Implementations exist for all three algorithms; cost is similar to inner join plus a small overhead. The planner has to be careful about predicate placement. WHERE on the outer column eliminates rows the OUTER would have preserved.
Semi/anti joins
EXISTS / NOT EXISTS / IN / NOT IN. Hash semi-join builds a hash on the inner, probes once per outer (no need to enumerate matches). Often the right rewrite of a correlated subquery.
Cross product
FROM a, b without an ON clause. Forced O(N × M). Usually a bug (forgot the join condition). Postgres warns; some DBs require explicit CROSS JOIN.
The numbers behind the plan
The cost-based optimiser needs statistics: how many rows each table has, how values are distributed (histograms), correlation between columns, frequency of common values (MCV list), distinct count per column (n_distinct). Without these, the planner is guessing. Its guesses lean toward Seq Scan because that's the safe default.
When stats go stale
UPDATEs and INSERTs don't update statistics in real time; they're refreshed by ANALYZE (manual or auto). The auto-vacuum daemon runs ANALYZE periodically on tables with sufficient churn. If your table doubled in size in the last hour and auto-vacuum hasn't caught up, the planner is working from the old row count, and may pick Nested Loop because it thinks the table is small.
Default sample size
Postgres samples 300 × default_statistics_target rows (default 100, so 30,000). For huge tables this can be too small for representative histograms. Bump default_statistics_target (or use ALTER TABLE col SET STATISTICS) for tables where the planner's row estimates are off.
Correlated columns
Single-column statistics don't capture correlation between columns. If country and language are perfectly correlated but the planner thinks they're independent, multi-column predicates produce wrong row estimates. CREATE STATISTICS gives the planner correlation info; pg_stats stores the result.
Cost model parameters
random_page_cost vs seq_page_cost is the most-tuned ratio. Default 4.0 vs 1.0 (random reads cost 4x sequential), but on SSDs the ratio is closer to 1.0:1.1. Lowering random_page_cost makes the planner favour Index Scans. cpu_tuple_cost, cpu_index_tuple_cost, cpu_operator_cost are the per-row CPU costs. Tune for your workload after measurement; don't trust intuition.
work_mem
The amount of memory each operator gets for sorts and hashes. Too low: operators spill to disk; too high: many concurrent queries OOM. Tune per query if needed (SET LOCAL work_mem = '256MB'; then run the query).
Queries that don't use the index you expected
Functions on indexed columns
WHERE LOWER(email) = 'x@y.com' won't use an index on email. The fix is either an expression index (CREATE INDEX ON users (LOWER(email))) or storing email in a normalized form. Same for date_trunc, ::text casts, substring, etc.
Implicit type casts
WHERE bigint_col = '42' (string literal against bigint column). Postgres may need to cast either side, which can prevent index use depending on cast direction. Use the right literal type or explicit cast.
OR conditions
WHERE a = 1 OR b = 2 with separate indexes on a and b can sometimes use both with BitmapOr, but often the planner gives up and Seq Scans. Rewrite as UNION ALL of two queries, or build a composite index, or accept the scan.
Negation
WHERE col != 5 or NOT IN: usually not selective enough for index use. The planner picks Seq Scan because most rows don't match the value, meaning most rows match the negation.
Leading wildcard
WHERE name LIKE '%foo' can't use a B-tree (which sorts left-to-right). Reverse the column and index it (or use trigram indexes via pg_trgm for fuzzy matching).
OFFSET on large pages
SELECT ... OFFSET 100000 LIMIT 10 makes the database compute the first 100,000 rows anyway. Use keyset pagination (WHERE id > last_seen ORDER BY id LIMIT 10) for cursoring.
SELECT *
Pulls all columns, including large ones (TEXT, JSONB) that may TOAST. Network/serialization cost. Specify columns; let Index-Only Scans work.
N+1 in app code
The plan looks fine for each individual query. The problem is firing 1000 of them in a loop instead of one query with IN(...). EXPLAIN won't catch this — only app-level query counting will.
EXPLAIN on Postgres, MySQL, SQLite, and beyond
The operator vocabulary differs by engine but the shape is universal: tree of operators, cost-based or rule-based planner, statistics-driven estimates, plan visible via EXPLAIN.
| Engine | Command | Notable |
|---|---|---|
| Postgres | EXPLAIN ANALYZE BUFFERS | Most detailed; tree of operators with cost, actual time, buffer hits, memory. |
| MySQL | EXPLAIN FORMAT=TREE / FORMAT=JSON | Tabular default; FORMAT=TREE since 8.0.18 shows tree structure. |
| MariaDB | EXPLAIN / EXPLAIN ANALYZE | Similar to MySQL. |
| SQLite | EXPLAIN QUERY PLAN | Compact textual plan, less detailed. |
| SQL Server | SET SHOWPLAN_XML / Live Query Stats | Graphical plan viewer in Management Studio; rich. |
| Oracle | EXPLAIN PLAN FOR / DBMS_XPLAN.DISPLAY | Cost-based optimiser with hints (/*+ INDEX(t idx) */). |
| BigQuery | EXPLAIN / dry-run | Distributed plan with shuffle stages. |
| Snowflake | Query Profile in UI | Same tree shape with per-stage stats. |
| Spark SQL | EXPLAIN EXTENDED / FORMATTED | Logical plan, optimised plan, physical plan all shown. |
Most operator names map across engines: Hash Join is Hash Join everywhere; Nested Loop sometimes appears as "BNL" (Batched Nested Loop) or just "join"; Sort, Filter, Limit, Aggregate are universal. Learning to read one EXPLAIN well transfers to the others quickly.
When the planner spawns workers
Modern databases parallelise within a single query. Postgres can use parallel Seq Scan, parallel Hash Join, parallel aggregate, parallel index scan. The planner decides whether to parallelise based on table size, cost, and the parallel-related GUCs (max_parallel_workers_per_gather, parallel_setup_cost, parallel_tuple_cost).
In EXPLAIN you'll see "Workers Planned: 4" and "Workers Launched: 4." If launched is less than planned, the system was out of workers; you may need to bump max_worker_processes. Each worker scans its share of the table; results are gathered at the top of the plan tree by a Gather or Gather Merge node.
Parallel query helps mostly on large analytical scans. It rarely helps OLTP queries because the setup cost (forking worker processes) outweighs the per-query savings. Tune for your workload.
Distributed query
On distributed databases (CockroachDB, Spanner, YugabyteDB, BigQuery), the plan also describes shuffles between nodes. Reading the plan is the same skill but with extra rows for "Distributed Shuffle" or "Broadcast" stages between operators. These are often the most expensive parts and worth focusing on first when optimising.
Order-of-magnitude operator cost
| Operator | Rows | Wall time (NVMe SSD, modern CPU) |
|---|---|---|
| Index Scan (PK lookup, cached) | 1 | ~50 µs (single page hit) |
| Index Scan (PK lookup, cold) | 1 | ~200 µs (random SSD read) |
| Seq Scan (10K rows, in shared_buffers) | 10,000 | ~5 ms |
| Seq Scan (10M rows, in shared_buffers) | 10,000,000 | ~5 s |
| Seq Scan (10M rows, from disk) | 10,000,000 | ~10-30 s |
| Hash Join (1M × 100K) | ~100K result | ~500 ms |
| Nested Loop with index (1K outer × 10M indexed inner) | ~1K result | ~30 ms |
| Nested Loop without index (1K × 10M) | any | ~hours — avoid |
| Sort (1M rows, in memory) | 1,000,000 | ~500 ms |
| Sort (10M rows, external merge) | 10,000,000 | ~30 s |
| HashAggregate (1M input, 1K groups) | 1K | ~200 ms |
| Network round trip (same DC) | — | ~0.5 ms |
| Network round trip (cross-region) | — | ~50-150 ms |
These numbers are starting points. Real measurements vary with CPU cache effects, kernel caches, network conditions, and concurrent load. Benchmark against your actual data.
The numbers also explain why index scans are usually preferred for transactional workloads (microseconds per row) and why bulk analytical queries need parallel and columnar storage (the per-row cost dominates).
Common interview questions
"How does a database execute a query?"
Parse → rewrite (views, subquery flattening) → plan (cost-based pick of operator tree) → execute (operators pulled top-down or pushed bottom-up depending on engine). Mention the cost model uses statistics. Optionally mention compiled vs interpreted execution (modern engines like DuckDB JIT-compile hot loops).
"What's the difference between WHERE and HAVING?"
WHERE filters rows before aggregation; HAVING filters groups after aggregation. Use WHERE wherever possible; filtering early reduces the rows that hit the aggregate. HAVING is only for predicates on aggregate expressions (HAVING COUNT(*) > 5).
"How does an index help?"
Two ways. First, it avoids reading the whole table for selective predicates: Index Scan reads only matching rows. Second, an index that covers a sort order avoids the Sort operator (Index Scan on a column with ORDER BY that column is essentially free). Mention the maintenance cost trade-off.
"How do you find a slow query?"
Enable pg_stat_statements / mysql slow log / db.collection.find().explain(). Sort by total time or mean time. For the worst offender, run EXPLAIN ANALYZE. Look for estimate vs actual mismatches, missing indexes (Filter rows scanned vs rows returned), and Sort/Hash operations that spill to disk.
"Should you add an index for every WHERE clause?"
No. Indexes have maintenance cost; too many slows down INSERTs/UPDATEs. For columns rarely filtered or with low cardinality, Seq Scan + Filter is often faster. Profile, add only indexes that the planner actually uses on real query loads.
"What is index selectivity?"
The fraction of rows that match. Highly selective (1 in 100,000): index scan is great. Poorly selective (1 in 2, like boolean): Seq Scan is usually faster because the random I/O of the index doesn't pay off.
"When would you NOT use a hash join?"
When the build side doesn't fit in work_mem and would spill to disk. When inputs are pre-sorted (Sort-Merge is then cheaper). When the outer table is tiny and the inner has an index (Nested Loop is cheaper). For non-equi joins (range joins), since hash joins only work on equality.
From rule-based to cost-based
Early relational databases (System R, Ingres in the 1970s) used rule-based optimisers: fixed heuristics like "always use an index if one exists" or "join smaller table first." These work surprisingly well for simple queries but break on complex ones.
The breakthrough came in System R's 1979 paper "Access Path Selection in a Relational Database Management System" by Selinger et al. They introduced dynamic programming over join orderings with statistics-driven cost estimation. Every modern cost-based optimiser is a descendant.
Postgres ships with both: the standard cost-based optimiser plus a genetic algorithm (GEQO) that kicks in for queries joining more than 12 tables (the dynamic programming becomes too expensive). MySQL has had a cost-based optimiser since 5.7. SQLite uses a simpler greedy optimiser.
The trade-off: cost-based optimisers can find plans rule-based ones miss, but they need accurate statistics, and they sometimes pick a clever-looking plan that's worse than the obvious one. Hints (Oracle, SQL Server) and pg_hint_plan let you override the planner when you know better.
Real-world planner failures
After a big data load
Load 100 GB into a table; query immediately. The planner thinks the table is empty (no ANALYZE has run) and picks Nested Loop. Query takes 6 hours instead of 30 seconds. Fix: ANALYZE every large table after a bulk load before running queries.
Skewed data
One value in a column appears in 80% of rows; everything else is rare. The planner's "average selectivity" estimate is wrong for both: the 80% value should use Seq Scan; the rare values should use Index Scan. Most-Common-Values lists (Postgres pg_stats most_common_vals) help; bumping default_statistics_target captures more MCVs.
Correlated predicates
Two columns that aren't actually independent: city = 'NYC' AND state = 'NY' (every NYC row is NY). The planner multiplies the selectivities and gets a row estimate that's much too low. Postgres CREATE STATISTICS lets you tell it about the correlation.
Bind variable peeking
Prepared statements compile a plan once, reuse for many parameter values. If the first set of parameters produces a Nested Loop plan that's wrong for later parameters, you're stuck. Oracle and SQL Server have adaptive cursor sharing; Postgres recompiles after a few executions.
Operator memory spill
Hash Join builds its hash table in work_mem. If work_mem is too small, the hash spills to disk, and the join is 10-50x slower. Look for "Batches: N" where N > 1 in EXPLAIN. Either bump work_mem for the session or add an index that makes Nested Loop competitive.
Catalog bloat
Repeated CREATE/DROP of tables, columns, indexes inflates pg_catalog. Planner time goes up. Symptom: simple queries are fast at execution but planning takes 50 ms. VACUUM FULL on pg_catalog or judicious use of pg_repack helps.
Filter pushdown failures
Through views, subqueries, and UNIONs, the planner usually pushes WHERE clauses down to the underlying scans. Sometimes it can't (correlated subqueries, certain LATERAL forms), and a predicate that should have filtered at the scan is evaluated at the top, after every row has been materialised. Look for high Rows Removed by Filter at the top of the plan.
Tools that visualise plans
explain.depesz.com: paste Postgres EXPLAIN output, get a color-coded breakdown with timing percentages and "this operator is your bottleneck" highlights. The de facto standard for sharing Postgres plans.
explain.dalibo.com: alternative visualizer with a different layout. Some prefer it for complex plans with many subplans.
pev2 (PEV2): embeddable Postgres EXPLAIN visualizer; useful for building dashboards over your team's query plans.
pg_stat_statements: Postgres extension that tracks every query's cumulative time, calls, rows. The starting point for finding which queries to optimise.
auto_explain: Postgres module that automatically logs the plan for any query exceeding a duration threshold. Catches plans that change over time as data grows.
MySQL Workbench Visual Explain: graphical plan viewer integrated into MySQL Workbench.
SQL Server Management Studio: execution plan viewer is the polished one; live query stats let you watch a long-running query mid-execution.
EverSQL, pganalyze, Datadog Database Monitoring: SaaS that ingest plan output and recommend indexes / surface regressions.
For any team running a meaningful query load, one of pg_stat_statements + auto_explain + a visualizer is the minimum viable observability for query performance. Without them, you're flying blind.
Terms used above, defined
Plan: tree of operators the database will execute to satisfy a query.
Operator (node): single step in the plan: Seq Scan, Hash Join, Sort, etc. Each consumes tuples from its child(ren), produces tuples for its parent.
Cost: planner's estimate of work for an operator, in arbitrary units (typically scaled to "fetching one sequential disk page = 1 cost unit"). Lower = cheaper.
Rows (estimated): planner's estimate of how many tuples this operator will produce. Compared to actuals in EXPLAIN ANALYZE.
Statistics: per-column histograms, most-common-value lists, distinct counts, null fractions. Updated by ANALYZE.
Selectivity: fraction of rows a predicate matches (0 to 1). Index Scan good for low selectivity, Seq Scan good for high.
Heap: the actual table file. Indexes point to heap tuples.
TID: tuple identifier; a (page, offset) pair pointing to a row in the heap.
work_mem: memory budget per operator for sorts and hashes. Exceeded = spill to disk.
shared_buffers: Postgres's in-memory page cache. Hit rate matters more than total size for read performance.
WAL: write-ahead log. All writes go here first for durability before being applied to data files.
MVCC: multi-version concurrency control. Readers don't block writers. Each transaction sees its own consistent snapshot.
ANALYZE: command that samples the table and updates statistics. Required for cost-based optimiser to make good decisions.
VACUUM: Postgres command that reclaims space from dead tuples (MVCC residue). Auto-vacuum runs in background.
JIT: Just-in-time compilation of plan expressions to native code. Postgres 11+ supports for analytical workloads.
Buffers: pages read from cache (hit) vs disk (read). EXPLAIN BUFFERS shows the split.
BNL: Batched Nested Loop. MySQL's optimisation: collect a batch of outer rows, then probe the inner once per batch.
LATERAL: SQL keyword allowing a subquery in FROM to reference columns of preceding FROM items. Often the cleanest way to express per-row computations that need a subquery.
CTE: common table expression (WITH). Historically a planner barrier in Postgres (always materialised); since 12, can be inlined.
Window function: function that operates over a window of related rows (RANK, LAG, ROW_NUMBER). Executed by WindowAgg operator.
Try these
1. On query #1 in the simulator, toggle the index. Estimated cost drops from ~1200 to ~4. That's roughly the speedup you'd see in real Postgres.
2. On query #3, toggle the table size from medium to large. Plan flips from Nested Loop to Hash Join. Why? Hint: the planner thinks the per-row cost of probing the inner index N times exceeds the cost of building the hash once.
3. Set stats to "stale" and watch the cost estimate balloon by ~80%. In real Postgres, stale stats can cause plan flips in either direction; the wrong plan is often vastly more expensive.
4. In a local Postgres, run CREATE TABLE t (id INT, v INT); INSERT INTO t SELECT i, random()*100 FROM generate_series(1, 1000000) i;. Run EXPLAIN ANALYZE SELECT count(*) FROM t WHERE v = 50;. Note the plan and cost. Now CREATE INDEX ON t(v); and run again. Compare.
5. Same table, run EXPLAIN ANALYZE SELECT count(*) FROM t WHERE v < 50;. Now the predicate matches ~50% of rows — does the planner use the index? Often not. Why?
6. Generate two related tables (users + orders, with orders.user_id referencing users.id). Run a JOIN; observe the plan. Vary the order of FROM tables; the plan should be the same — the planner reorders joins.
7. Use pg_stat_statements (or your engine's slow query log). Find the top 5 queries by total time. EXPLAIN ANALYZE each. Identify what's expensive (Seq Scan? Sort spill? Bad join algorithm?). Propose one fix per query.
8. Read https://use-the-index-luke.com/ — the definitive book on SQL indexing, free online. Bookmark and return.
9. Run the same query under different work_mem settings. Watch the plan switch between HashAggregate and Sort+GroupAggregate. SET work_mem = '1MB' then '64MB' then '256MB' on a large GROUP BY query.
10. Find a complex query in your codebase. Run EXPLAIN ANALYZE BUFFERS. Estimate which operator takes the most time before reading the output. Then check. Calibrate intuition.
The core five
1. SQL is declarative; the planner picks an operator tree based on cost estimates.
2. EXPLAIN shows the tree; ANALYZE adds actual times and rows. Compare estimates to actuals.
3. The two killer choices are scan type (Index vs Seq) and join algorithm (Nested Loop vs Hash vs Merge).
4. Stale statistics is the #1 cause of bad plans. Run ANALYZE after large data changes.
5. When in doubt, look at "rows" first (estimate vs actual). Get that right and the rest follows.
Why analytical engines are 100x faster on the same query
Traditional row-store engines (Postgres, MySQL, SQL Server's OLTP storage) execute one row at a time through the operator tree. Each operator pulls a tuple, processes it, hands the next tuple up. This is the Volcano model — beautiful and general, but per-row interpretation overhead dominates on analytical workloads.
Columnar engines (DuckDB, Snowflake, BigQuery, ClickHouse, the columnstore tables in SQL Server, Parquet-backed engines) store data column-by-column on disk and execute operators on batches of values at a time. A SUM over a column becomes a SIMD loop over a packed array of integers. Filters become bitmaps applied to columns. Operators read whole batches at a time and apply work to them in a tight loop.
The result: order-of-magnitude speedups on large analytical scans. A 1-billion-row aggregate that takes 30 seconds in a row-store can take 200 ms in a columnar engine on the same hardware. The trade-off is that columnar storage is slow for point lookups (need to read multiple files / column chunks just to reconstruct one row) and updates are expensive (rewriting columns to add a row).
JIT compilation
Modern engines JIT-compile expressions in the plan to native machine code. Postgres 11+ does this via LLVM for analytical queries; DuckDB and ClickHouse use it pervasively. A WHERE x > 5 AND y < 10 expression gets compiled to a handful of SSE/AVX instructions instead of being interpreted via a tree of function calls.
Late materialisation
In a columnar engine, you can apply filters on individual columns before materialising a row. SELECT * FROM t WHERE a = 1 AND b = 2 reads only column a, finds matching positions, reads only column b at those positions, finds matching positions, then reads the remaining columns. Way less I/O than row-store.
The decision rule: OLTP (lots of small reads, lots of writes, point lookups) goes on a row store. OLAP (large aggregates, full table scans, append-only) goes on a column store. Many modern stacks use both — Postgres for primary OLTP, Snowflake or BigQuery for analytics over the same data after CDC.
Forcing the planner's hand
Sometimes the planner picks badly and statistics-tuning isn't enough. The escape hatches:
Postgres: rewrite the query
Common rewrites that change plan: replace subquery with JOIN (or vice versa); CTE materialisation hint (WITH foo AS MATERIALIZED ... in 12+); use LATERAL instead of correlated subquery; pull a complex predicate into a CASE in SELECT; explicitly cast literals; split OR into UNION ALL.
Postgres: pg_hint_plan
Extension that allows planner hints in comments: /*+ HashJoin(t1 t2) IndexScan(t3 idx) */. Use sparingly — hints fossilise decisions that should adapt to data changes.
MySQL: STRAIGHT_JOIN / FORCE INDEX
Built-in hint syntax. STRAIGHT_JOIN locks join order to written; FORCE INDEX (idx_name) forces use of a specific index. Easier than Postgres's extension approach but same trade-off.
SQL Server: query hints
OPTION (LOOP JOIN), OPTION (HASH JOIN), OPTION (RECOMPILE), USE PLAN N'...'. Rich set; full plan control, easily abused.
Oracle: optimiser hints
/*+ INDEX(t idx) USE_HASH(t1 t2) */ in the SELECT clause. The most mature hint system; Oracle DBAs lean on hints heavily.
Last resort: stored plans
Postgres has plan_cache_mode; Oracle has SQL plan baselines; SQL Server has Query Store. Each pins a plan once you've found a good one, so the planner can't pick a worse one as data changes. Critical for queries where plan stability matters more than optimal cost.
The lesson: rewrites are usually better than hints. Hints lock data shape; rewrites let the planner choose appropriately as data grows. But sometimes a hint is the right call — usually because the planner's cost model has a known gap (correlated columns without CREATE STATISTICS, etc.).
Recommended reading
use-the-index-luke.com by Markus Winand. The single best resource on SQL indexing. Covers every engine. Free online; the book version is short and concentrated.
Database Internals by Alex Petrov. Half the book is about query execution: storage engines, indexes, planners. The other half is distributed databases. Both halves are worth your time.
SQL Performance Explained by Markus Winand (the book companion to use-the-index-luke). 200 pages on what makes queries fast. Worth $35.
Postgres documentation: Query Planning (https://www.postgresql.org/docs/current/runtime-config-query.html and the EXPLAIN reference). The most useful production database documentation on the topic.
The original System R paper by Selinger et al., 1979. Still readable, still the foundation. 14 pages.
The Vertica + C-Store papers on columnar execution if you want to understand modern analytical engines.
EXPLAIN.depesz.com as a daily tool for understanding actual queries from your production system.
Bookmark these. Return to them whenever EXPLAIN ANALYZE confuses you. The skills compound — six months of careful query reading makes you the person on the team that everyone brings slow queries to.
Three real plan disasters and their fixes
Case 1: analytics dashboard, midnight regression
A SaaS dashboard query, fast all day, became 30+ seconds at midnight every night. Cause: the auto-vacuum daemon ran ANALYZE on the central transactions table at midnight, refreshing statistics with a new histogram that suggested the planner should use Hash Join. But the actual data distribution was bimodal — most queries hit the small-tenant case where Nested Loop was correct. The fix: ALTER TABLE transactions ALTER COLUMN tenant_id SET STATISTICS 5000; (default is 100), giving the planner a much finer histogram so it could distinguish small-tenant from large-tenant rows.
Case 2: order pagination, 100ms became 60s
A "next page" endpoint using OFFSET 100000 LIMIT 20 became unusable as the table grew. The plan was correct (Index Scan + Limit) but the engine still had to count past 100,000 rows on the index. Fix: rewrite as keyset pagination — WHERE (created_at, id) < (last_seen_created_at, last_seen_id) ORDER BY (created_at, id) DESC LIMIT 20. Index Scan walks 20 rows instead of 100,020. The application now passes the cursor instead of the page number.
Case 3: JOIN against a partitioned table
A reporting query against a partitioned-by-day events table got 10x slower after partitioning. Cause: the planner couldn't push the join predicate down to individual partitions because the partition key was different from the join key. Fix: include the partition key in the join condition (events.day = orders.created_day) so the planner could prune partitions, and create a covering index on the partition's foreign-key column. Query went from a 60-partition fan-out scan to a 3-partition focused scan.
In all three cases, EXPLAIN ANALYZE was the diagnostic tool. The fix was rarely "add an index" — more often it was statistics tuning, query rewriting, or schema adjustment. The skill is reading the plan well enough to know which lever applies.
Frequently asked, briefly answered
"Does the order of JOIN matter?"
For most modern planners, no — the planner reorders joins to find the cheapest tree. There are exceptions: explicit STRAIGHT_JOIN (MySQL) preserves order; the Postgres GUC join_collapse_limit caps how many joins the planner reorders (default 8); CROSS JOIN LATERAL preserves order by definition. For typical 2-5 table joins, write in the order that reads naturally — the planner will reorder.
"Why is my query suddenly slow?"
Probable causes, in order of likelihood: (1) data grew; statistics stale or plan no longer optimal. (2) An index was dropped or a new one shadowed an existing one. (3) Concurrent load on the same tables (lock contention). (4) Disk space pressure → caches getting evicted. (5) Network or storage latency change. EXPLAIN ANALYZE + check pg_stat_user_tables for autovacuum lag.
"Will adding more indexes help reads at the cost of writes?"
Yes, but the cost is asymmetric — each index adds ~10-30% to write cost depending on size. Five indexes on a hot table can double write time. The right answer is usually 2-5 carefully-chosen indexes per table, not 15 reactive ones. Use pg_stat_user_indexes to find indexes that are never used and drop them.
"What's the difference between LEFT JOIN and LEFT OUTER JOIN?"
Nothing. OUTER is optional. Both produce the same plan and the same result. Same for INNER vs implicit JOIN. Stylistic preference only.
"Should I use SELECT * or specify columns?"
Specify columns. Reasons: enables Index-Only Scan; lower network and serialisation cost; doesn't break if the schema gains columns; documents what the query actually needs. The cost of typing column names is paid once; the cost of SELECT * is paid forever.
"How do I know if a query will be slow before I run it?"
Run EXPLAIN (without ANALYZE — no execution, just plan). Look at top-level cost. As a rule of thumb: cost < 1000 = fast, 1000-100,000 = moderate, 100,000+ = will be slow on cold caches. But cost is an estimate; only EXPLAIN ANALYZE on a test dataset tells you actual time.
"Is denormalisation faster?"
For reads, yes — fewer JOINs, less data movement. For writes, no — every update has to maintain consistency across denormalised copies. The right answer depends on read:write ratio and whether stale data is acceptable. Materialised views are a clean middle ground.
"What does 'rows=0' on an estimated row count mean?"
Postgres can't estimate down to zero, so this usually means "the planner doesn't know" (sometimes from no statistics, sometimes from a predicate the optimiser can't reason about). Often a sign that the actual rows will be very different from the estimate.
The whole simulator distilled
"The SQL planner reads your declarative query, picks an operator tree based on cost estimates derived from table statistics, and the tree of scans, joins, filters, aggregates, and sorts is what actually executes — and the most useful debugging skill in databases is reading EXPLAIN ANALYZE to confirm the plan is what you expected and that the estimates match reality."
That's the page in one sentence. Now go open EXPLAIN ANALYZE on your slowest query and read it line by line. Look at the bottom — is the first operator an Index Scan or a Seq Scan? Look at the join algorithms — does the planner agree with the size of your tables? Look at the row estimates vs actuals — is anything off by 10x?
If you can answer those questions, you can fix the query. If you can't, you have a starting point for investigation. Either way, the next time someone asks you to look at a slow query, you have a process.
Final note
Plans aren't static. As data grows, as you add indexes, as ANALYZE updates statistics, the planner may pick different plans for the same query. This is a feature — the cost-based optimiser adapts. But it's also a source of surprise; a query that was fast for years can become slow overnight after a data shift. The defence is observability: pg_stat_statements, auto_explain, slow-query logs, plan capture. Treat the planner like any other system component that can change behaviour under load.
Plans also vary subtly across engine versions. Postgres 16 picks differently from Postgres 12 on certain joins. MySQL 8.0 added the optimiser hint syntax that didn't exist in 5.7. Always test plan-sensitive queries when upgrading.
If you make one habit
Run EXPLAIN ANALYZE before deploying any query that touches a table with more than 100,000 rows. The 60 seconds you spend reading the plan saves the next on-call engineer hours of investigation when the query starts misbehaving in production. This single habit, applied consistently, separates engineers who write fast code from those who write code that becomes fast after enough firefighting.
The simulator above showed you what to look for. The prose described what each piece means. The real production database is where the skill compounds — every plan read, every fix applied, every reasoned-through estimate-vs-actual gap builds the intuition you need.
Bookmark this page. Return when EXPLAIN ANALYZE confuses you. The skill compounds.
— end of page —
A few interactive resources to pair with this sim
explain.depesz.com: paste a Postgres EXPLAIN, get colour-coded breakdown. Free, reliable, no signup. The de-facto standard for sharing slow plans with colleagues.
explain.dalibo.com — alternative with a flame-graph-like layout. Some prefer it for complex plans with many subplans.
pgMustard — paid Postgres plan analyzer with concrete recommendations ("you'd benefit from an index on column X"). The recommendations are surprisingly good.
Postgres EXPLAIN.tips — community-maintained library of plans and their fixes. Useful when your plan matches a known pattern.
SQL Server Management Studio's plan viewer — graphical, with operator costs displayed proportionally. The most polished plan UI in any database.
pgcli / mycli — terminal clients with autocomplete; useful daily for running EXPLAIN ANALYZE interactively without leaving the shell.
Most production teams adopt one of these and standardise on it. The skill of reading plans transfers across tools quickly; the muscle memory of using a specific UI is what makes the practice fast enough to do habitually.
For learners: start with explain.depesz.com on real queries from your codebase. Within a month you'll recognise the patterns: stale stats, missing index, bad join, sort spill. Within three months you'll spot them at a glance.
Standing on shoulders
This page exists because of decades of work by query-optimiser researchers and database engineers. The System R team for the original cost-based-optimiser blueprint; the Postgres community for an extensible, open implementation that's been a teaching tool for thousands of engineers; Markus Winand for writing the indexing material that bridges theory and practice; depesz, dalibo, pganalyze, pgMustard, and every other tool team that makes plan output legible.
Errors and over-simplifications are mine. Send corrections via the about page contact details.
Now go open EXPLAIN ANALYZE on your slowest query.
— Semicolony, 2026-05-25
(Final closing note: every query that runs in your production database tonight is the descendant of the System R paper from 1979. Forty-seven years of incremental improvement, still recognisable as the same algorithm. Most things in computing don't age that well.)
(Truly last line: now go open EXPLAIN ANALYZE.)
(Page length: 26 sections of prose plus an interactive plan-tree simulator. Read time ~50 minutes for the whole thing; ~3 minutes for the simulator. Last updated 2026-05-25.)
(Two more closing thoughts, then truly done.)