14 / 14

Internals / 14

Index types

An index is a second copy of part of your data, organised for one specific question. Get the question right and a hundred-million-row table answers in microseconds. Get it wrong and the index does nothing on reads, doubles the cost of every write, and quietly rots the query planner's statistics until something falls over at 3am. This is a tour of every index type you're likely to meet: B-tree, hash, GIN, GiST, SP-GiST, BRIN, inverted, vector, bitmap. What each one is, what it stores, what queries it answers in O(log n) and what queries it makes worse, and when to reach for it.

What an index actually is

An index is a separate data structure that maps a value (or composite of values) to a location in the table: a page id and slot in heap-based engines, a primary key in clustered storage. It is, mathematically, a redundant copy of one column or expression, kept in a shape that supports a particular query.

Every index costs you something on writes: every INSERT, every UPDATE of an indexed column, and every DELETE has to update the index too. A table with five indexes spends roughly six write IOs per row change (one for the heap, one per index, sometimes less with HOT updates in Postgres). Indexes also cost RAM. The planner wants the hot pages of the index in the buffer pool, and a large index that does not fit there leaks the savings it was supposed to provide.

The question to ask before adding an index is never "would this query be faster" (almost always yes). It's "is this query frequent or critical enough that the steady-state write cost and RAM footprint are worth it". The good DBAs spend more time dropping indexes nobody is using than adding new ones.

B-tree — the default and almost always right

A balanced B+ tree keyed by the indexed expression. The leaf level is a sorted doubly linked list of (key, tuple-pointer) pairs; internal nodes route lookups by key range. Tree height is typically 3-5 even for tables with billions of rows, because fan-out is hundreds of keys per page. Lookup is O(log n) with constants small enough that a single point query fits in a handful of buffer-pool hits.

B-tree indexes answer four shapes of query well:

Equality. WHERE id = 42. Walk to the leaf, return the pointer. One to four buffer reads, microseconds.

Range. WHERE created_at BETWEEN '2026-01-01' AND '2026-01-31'. Walk to the leaf at the start of the range, follow the leaf chain. Cost is proportional to the size of the range, which is usually fine. Once the range is half the table, a sequential scan wins.

Sort. ORDER BY created_at DESC LIMIT 100. The index is already in sort order; walk the leaf chain backwards from the end. No actual sort happens.

Prefix on composite keys. A B-tree on (a, b, c) answers WHERE a = ? and WHERE a = ? AND b = ? and WHERE a = ? AND b = ? AND c > ?. It does not answer WHERE b = ? without scanning the whole index. The leftmost- prefix rule is the single most important thing to know about composite indexes.

-- composite index (a, b, c)
SELECT * FROM t WHERE a = 1;                          -- uses index
SELECT * FROM t WHERE a = 1 AND b = 2;                -- uses index
SELECT * FROM t WHERE a = 1 AND b = 2 AND c > 10;     -- uses index
SELECT * FROM t WHERE a = 1 AND c = 10;               -- uses (a) part only
SELECT * FROM t WHERE b = 2;                          -- does NOT use index
SELECT * FROM t WHERE c = 10;                         -- does NOT use index

The order of columns matters: (country, city, zip) answers different queries than (zip, city, country). Put the column most often filtered by equality first; put the column most often used in a range last. The B-tree deep dive in this section covers the page layout, split mechanics, and latching.

Hash — niche, rarely the right pick

A hash index stores (hash(key), pointer) in a hash table. Equality lookup is O(1): a single hash and a fetch. It cannot do ranges, sorting, prefix matching, or composite-key prefix matching. It does not order leaves, so it cannot speed up ORDER BY.

In Postgres, hash indexes used to be unsafe across crashes (they weren't WAL-logged until 10.0). They are crash-safe now, but the planner still rarely chooses them over a B-tree because B-trees do equality almost as fast and do everything else better. The case for a hash index is narrow: a table where the column is huge (say, a 2KB JSON blob keyed by itself), the only access is exact equality, and you can prove the B-tree is paying for ordering you don't need. In practice almost no one ships hash indexes.

In MySQL InnoDB, the "adaptive hash index" is a different beast: an in-memory cache that the engine maintains on top of the B+ tree to short-circuit repeated lookups of the same keys. It's automatic, can be disabled, and on heavily-mixed workloads sometimes hurts more than it helps. The MySQL world has spent two decades debating whether to leave it on.

GIN — the right index for arrays, jsonb, and tsvector

Generalized Inverted iNdex. Every key in the indexed expression (every array element, every json key, every word in a tsvector) becomes a posting list of tuples that contain it. A query then intersects posting lists.

GIN excels at "contains" queries on multi-valued columns. WHERE tags @> '{red,large}' on an array of tags, WHERE metadata @> '{"role": "admin"}' on a jsonb, or WHERE search_vector @@ 'database & index' for full-text. Each "element" is indexed once; a row with 20 tags creates 20 entries.

The cost is on writes. Updating a row with an indexed jsonb means rewriting every posting list the changed keys belong to. To make this tolerable, GIN buffers pending updates in a separate list and folds them into posting lists in the background; gin_pending_list_limit controls how aggressively. On write-heavy workloads with jsonb indexes, the pending list can grow large enough that queries start scanning it linearly, and you end up VACUUM'ing the index manually.

CREATE INDEX idx_tags ON products USING GIN (tags);
CREATE INDEX idx_metadata ON products USING GIN (metadata jsonb_path_ops);
CREATE INDEX idx_search ON articles USING GIN (to_tsvector('english', body));

-- jsonb_path_ops is smaller and faster for @>, but cannot do existence (?)
-- the default GIN jsonb_ops indexes more operators at the cost of size

GiST and SP-GiST — geometry, ranges, custom types

Generalized Search Tree. A framework, not a single algorithm: you implement the index behaviour (split, union, consistency check) for your data type, and the engine handles page management and concurrency. The cases where it ships in Postgres without extra work are geometry (PostGIS) and range types (tsrange, int4range, daterange).

GiST indexes answer questions about overlap, containment, and proximity. "Find every delivery whose service area contains this address" runs against a GiST index on the service-area geometry. "Find every meeting room booking that overlaps with 2-3pm" uses a GiST index on a tsrange. The cost model is messy because GiST trees are not balanced the way B-trees are; a poor split function can produce a lopsided tree that runs badly.

SP-GiST (Space-Partitioned GiST) is the sibling for non-overlapping space partitions: quad-trees, k-d trees, radix trees. Used for IPv4 ranges (inet operator class), prefix search on text (trgm_ops in pg_trgm), and some PostGIS workloads. The internal trees are unbalanced, which is fine for the workloads they target. The structure of the data determines the structure of the index.

GiST and SP-GiST are extension points, not finished indexes. If you adopt one, read which operator class you are using and what queries it actually accelerates; gist_text_ops is not the same as gist_trgm_ops, and confusing them creates indexes that look used but never get picked by the planner.

BRIN — block range indexes, for huge sequential tables

BRIN is a strange beast: it does not index every row. It indexes ranges of pages. For each block range (default 128 pages, ~1MB), BRIN stores the minimum and maximum value of the indexed column. A query for WHERE created_at BETWEEN ? AND ? reads the BRIN summary, finds the block ranges whose min/max overlap the query, and scans only those.

This works when the data is naturally clustered on the indexed column — most often when the column is monotonically increasing (timestamps, ids) and the table is append-only or append-mostly. On a 100GB table of time-series data with a BRIN on the timestamp, the BRIN itself is a few megabytes. A query for last week's data scans the right slice of the table directly. On the same table with a B-tree, the index might be 10GB, larger than your buffer pool, and the buffer-pool churn dominates.

BRIN's failure mode is data that is not clustered. If timestamps are sprinkled randomly through the table (say, a re-shuffled import), every block range's min/max covers the whole timestamp space, BRIN matches everything, and the query degrades into a full table scan with extra steps.

CREATE INDEX idx_events_t ON events USING BRIN (event_time) WITH (pages_per_range = 32);

-- BRIN on a 1-billion-row events table, clustered by event_time:
--   index size: ~3 MB
--   point query: scans 32 pages instead of millions
--   write cost: nearly zero (just update the summary)
-- BRIN on the same table, unclustered:
--   index size: ~3 MB
--   point query: scans entire table because every range overlaps everything

Bitmap indexes — analytical workloads

A bitmap index stores one bit per row per indexed value: column status with values {active, paused, canceled} on a billion-row table becomes three bitmaps of a billion bits each, compressed. Bitmap AND/OR over multiple columns is just bitwise AND/OR, which is enormously fast.

Bitmap indexes are the workhorse of analytical databases: Oracle DBs in the early 2000s, now mostly column-store warehouses (BigQuery, Snowflake, ClickHouse all build bitmap-like structures under the hood). They are not generally suitable for OLTP because updates are expensive: a single row update rewrites bitmap pages. Postgres builds bitmap indexes at query time. A query that combines two B-tree indexes builds an in-memory bitmap, ANDs them, then fetches matching rows. The bitmap is transient; the persistent indexes are B-trees.

If you find yourself wishing for a persistent bitmap index in Postgres, the answer is usually that the workload belongs in a warehouse, not the OLTP database.

Inverted indexes — the Lucene shape

An inverted index maps from a term to the list of documents containing that term. It is the canonical full-text index. Elasticsearch, OpenSearch, Solr, Tantivy, and Postgres GIN (with tsvector) all implement it with variations.

A document is tokenised (split into terms), filtered (lowercased, stop-worded, stemmed), and the resulting terms are added to per-term posting lists. Each posting holds the document id, often the term frequency, sometimes the positions within the document for phrase queries. Queries are then set operations over posting lists, weighted by relevance (BM25, the modern default).

Inverted indexes are large, typically 30-100% of the source data size after compression, and slow to update incrementally. Lucene's segment model handles this by keeping segments immutable: writes go to a new segment, queries union across segments, background compaction merges segments. Same idea as LSM-trees. Deletes are tombstones, applied at query time, removed at merge time.

Inside Postgres, GIN-on-tsvector gives you a respectable inverted index without leaving the database. It's enough for "search this 5GB articles table". It is not enough for "search a billion log lines with relevance ranking and faceting". That is where you reach for a dedicated engine.

Vector indexes — approximate nearest neighbour

A vector index speeds up the question "given this 768-dimensional vector, find the K closest vectors in this table". Brute force is O(n × d) per query: fine at thousands of rows, painful at millions, infeasible at billions. Vector indexes give up exactness for speed; the result is the approximate top-K (recall typically 95-99%).

Two families dominate in production:

HNSW (Hierarchical Navigable Small World). A multi-layer graph. The top layer has a few nodes with long-range links; lower layers have more nodes with increasingly local links; the bottom layer contains every vector. A query enters at the top, greedily walks toward the query vector at each layer, and refines on the bottom. Best recall-vs-latency in benchmarks, but indexes live entirely in RAM (no on-disk variant has matched it) and memory cost scales linearly with row count.

IVF (Inverted File) and IVF-PQ. Cluster the vectors into K coarse cells via k-means. Each cell holds the vectors that landed there. A query finds the few closest cells to the query vector and scans only those. Adding product quantisation (PQ) compresses each vector into a small code, so the index fits in RAM at the cost of recall. Best for very large indexes (hundreds of millions of vectors) where HNSW would not fit. Faiss started this family; pgvector exposes IVFFlat and IVF-PQ; Pinecone and Weaviate offer it alongside HNSW.

-- pgvector
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops)
  WITH (lists = 100);

-- query
SELECT id, name FROM items
ORDER BY embedding <=> :query_vec
LIMIT 10;

Vector indexes are where the "index pays for ordering you don't need" inversion gets interesting. You almost never want exact nearest neighbour; you want "10 things roughly like this, fast". The /vs/ comparison on pgvector vs Pinecone vs Weaviate (when written) covers the cross-system trade-offs.

Partial, expression, and covering indexes

Three index modifiers that change cost dramatically without changing the index type.

Partial. Index only the rows that match a predicate. CREATE INDEX ON orders (created_at) WHERE status = 'pending'. If 1% of rows are pending, the index is 1% of its size. The catch: only queries that include the matching predicate use the index, and the planner has to prove the match. Keep the predicate simple; WHERE status IN ('pending','review') works, weird expressions tend not to.

Expression. Index a function of a column. CREATE INDEX ON users (lower(email)). The query has to use the same expression, WHERE lower(email) = ?, for the index to fire. Common uses: case-insensitive search, computed columns, date_trunc for time bucketing. Index size is roughly the same as a column index; write cost is slightly higher because of the function evaluation.

Covering (INCLUDE). Add non-key columns to a B-tree's leaf pages so the query can be answered entirely from the index without touching the heap (index-only scan). CREATE INDEX ON orders (user_id) INCLUDE (total_cents, currency). Reads involving only the indexed and included columns skip the heap entirely. Critical for hot queries on wide tables. The savings can be 10-100x because heap access is random IO and index leaves are sequential.

The choosing-an-index decision tree

A rough hierarchy that catches 90% of real-world cases:

Equality or range on a single scalar column? B-tree. Composite B-tree if multiple columns get queried together; leftmost-prefix rule applies.

Multi-valued column (array, jsonb, tsvector)? GIN. Use the more specific operator class (jsonb_path_ops, gin_trgm_ops) if you can — they are smaller and faster.

Geometry, range type, overlapping intervals? GiST or SP-GiST, with the operator class for the data type. Postgres ships gist_geometry_ops_2d for PostGIS, gist_range_ops for range types, gist_trgm_ops for fuzzy text.

Very large append-mostly table indexed by a clustered key (timestamp, sequential id)? BRIN. Verify clustering with pg_stats.correlation; if it is close to 1.0 or -1.0, BRIN works.

Full-text search at small to medium scale? GIN on tsvector. At very large scale, dedicated engine (Elasticsearch, Tantivy, OpenSearch).

Vector similarity? HNSW if RAM allows; IVF-PQ at very large scale. Both via pgvector for Postgres-native, or a dedicated vector store for billion-vector workloads.

Bitmap-like analytical aggregations on a column with low cardinality? Not Postgres. Reach for a warehouse, or use Postgres's transient bitmap combine of two B-trees, which is often good enough.

When indexes turn against you

Three failure modes that account for most index disasters.

Unused indexes that still cost on writes. Check pg_stat_user_indexes for indexes with idx_scan = 0. They are pure overhead. Drop them. A common pattern: an old composite index goes redundant when a newer index covers the same prefix; the planner switches; the old index sits there forever paying the write tax.

Index bloat. Long-running transactions hold back vacuum, dead tuples pile up in heap and index alike, indexes balloon to 3-5x their natural size, the buffer pool can no longer fit the hot pages, queries that used to be fast start hitting disk. Symptom: index size dwarfs the table; relfrozenxid lag is high. Cure: vacuum, sometimes REINDEX CONCURRENTLY.

Wrong column order on composite indexes. A composite on (user_id, created_at) answers WHERE user_id = ? in O(log n) and WHERE user_id = ? AND created_at > ? in O(log n). The same index on (created_at, user_id) answers neither well: queries by user_id alone do not use it, and queries that filter on user_id and range on created_at do a full index scan. The fix is to drop and recreate in the right order, which on big tables is a multi-hour operation that needs CREATE INDEX CONCURRENTLY.

Reading the plan

EXPLAIN ANALYZE tells you which indexes are actually used. The shapes to recognise:

Index Scan using idx_x on table — a B-tree (or GiST/GIN) index finds the rows, then the engine fetches each from the heap. Fast for small result sets.

Index Only Scan — the query was answered entirely from the index, no heap access. The dream case; usually means a covering index. Sometimes degraded by visibility map state: Postgres still has to confirm tuples are visible, and if the visibility map is stale, it hits the heap anyway.

Bitmap Heap Scan + Bitmap Index Scan — multiple indexes were combined into an in-memory bitmap, then heap pages were fetched in physical order. Common for queries with two or three filterable columns each with their own index.

Seq Scan — the planner decided indexes weren't worth it (either none matched, or the predicted selectivity was high enough that scanning was cheaper). Often correct; sometimes a sign of bad statistics. ANALYZE first, then question.

Statistics matter as much as indexes. A planner with stale or wrong row-count estimates will pick the wrong index. Postgres samples a few thousand rows per ANALYZE; for columns with skewed or multi-modal distributions, increase default_statistics_target (per-column or globally) and re-analyze. The query plan can shift overnight when an autoanalyze finally runs.

Index types

What an index actually is

B-tree — the default and almost always right

Hash — niche, rarely the right pick

GIN — the right index for arrays, jsonb, and tsvector

GiST and SP-GiST — geometry, ranges, custom types

BRIN — block range indexes, for huge sequential tables

Bitmap indexes — analytical workloads

Inverted indexes — the Lucene shape

Vector indexes — approximate nearest neighbour

Partial, expression, and covering indexes

The choosing-an-index decision tree

When indexes turn against you

Reading the plan

Further reading