10 / 14

Internals / 10

Buffer pool & page cache

The buffer pool is the in-memory layer that sits between the SQL executor and disk. A page in RAM is roughly 100,000× faster to read than the same page on an NVMe SSD, and the gap is even wider for spinning disks. Almost everything a database does to feel fast — eviction policy, prefetching, double-buffering, pinning — happens inside this cache.

The setup — pages, not rows

Databases don't read rows from disk, they read pages — fixed-size blocks that the storage layer treats as the atomic unit of I/O. Postgres uses 8 KB pages, MySQL InnoDB defaults to 16 KB, SQL Server uses 8 KB, Oracle is configurable (2–32 KB), and SQLite picks per-database between 512 B and 64 KB (4 KB default since 3.12).

The size is chosen to match the OS page (4 KB on Linux, 16 KB on Apple Silicon) and the SSD's native block. Read amplification on a small row is the cost of this choice: fetching one 100-byte tuple still drags an 8 KB page into the buffer pool. The buffer pool exists so the next request for any of the other tuples on that page is free.

Numerically: a DRAM read is ~80 ns; an NVMe page read is ~100 μs; a spinning-disk seek is ~10 ms. The buffer pool is the layer that turns the second number into the first.

Hit ratio — the metric that matters

The single number every DBA watches is the buffer pool hit ratio — the fraction of page requests served from RAM rather than disk. Healthy OLTP workloads run at 99% or higher. Warm analytics sit at 80–95%. Cold OLAP scans, where the working set is much larger than RAM, will happily run at 0–50% and there's not much you can do about it.

The non-linearity is brutal. A workload at 99.5% hit ratio that drops to 98.5% has tripled its disk reads. That's usually the difference between sub-millisecond and tens-of-milliseconds query latency, which is the difference between an app that feels instant and one users complain about.

Why hit ratio is everything. A 100 GB working set on a 96 GB buffer pool misses 4% of the time; the same workload on a 128 GB buffer pool misses 0%. Doubling RAM doesn't double performance — it can shift latency by a full order of magnitude. This is why most database sizing advice starts with "more memory" before anything else.

Eviction — LRU in the textbook, CLOCK in production

When the buffer pool is full and a new page needs to come in, something has to leave. Least-Recently-Used is the textbook answer: keep a doubly-linked list ordered by access time, evict the tail. In production this rarely ships unchanged — touching the list on every page access means taking a lock on a structure every query thread contends for, and the cache-line bouncing kills throughput on a 64-core box.

Postgres uses a clock-sweep variant: each buffer has a usage counter that increments on access (capped at ~5) and a sweep hand decrements counters as it looks for a victim with count zero. No global LRU list, no per-access list shuffle. MySQL InnoDB uses a midpoint-insertion LRU: newly-loaded pages enter the list 3/8 of the way down, not at the head, so a single big scan can't promote a flood of one-shot pages above the working set. Oracle uses a touch-count algorithm with similar intent.

Pinning — pages you can't evict

A page being actively read or written is pinned — its pin count is non-zero and the eviction logic must skip it. Pinning is short-lived in OLTP (microseconds per row fetch) but a long sequential scan that holds dozens of pages pinned can starve the rest of the workload of replacement candidates.

Postgres caps each backend's local pin count to keep one query from monopolising the pool; it surfaces as "BufferPin" wait events in pg_stat_activity. InnoDB does similar accounting via buf_pool->free_list and the LRU_old fraction. A workload that produces a lot of BufferPin waits is usually one with long-running cursors or a streaming replica that's fallen behind.

Buffer pool vs OS page cache

There are two designs in the wild. Single-buffered systems rely entirely on the OS page cache — they use buffered read()/write() and let the kernel manage what stays hot. SQLite, by default, works this way. The benefit is simplicity; the cost is giving up control over eviction, prefetch, and write ordering.

Double-buffered systems keep their own buffer pool and sit on top of the OS cache. Postgres is the canonical example: it asks the kernel for pages normally, so warm data is cached twice — once in shared_buffers, once in the page cache underneath. Memory is wasted on the duplication, but Postgres gets to evict based on its own knowledge of which pages back hot indexes.

Why MySQL uses O_DIRECT. InnoDB defaults to innodb_flush_method = O_DIRECT on Linux, which bypasses the OS page cache entirely. The reasoning: InnoDB already manages 16 KB pages in its own buffer pool with a smarter midpoint LRU than the kernel's, and double-buffering wastes RAM that could go to innodb_buffer_pool_size. The cost is losing the kernel's read-ahead for full table scans — InnoDB's own prefetcher has to make up the gap.

Cold-data pollution

The pathological case for any cache is a one-shot scan that's larger than the cache itself. A single SELECT * FROM big_table can evict every useful page, leaving the buffer pool full of data the next query will never touch. Hit ratio collapses, latency spikes for everything else, and the cache needs minutes to rewarm.

Postgres defends against this with a ring buffer for sequential scans — roughly 256 KB of dedicated buffers reused for the duration of the scan, so the rest of shared_buffers is untouched. Postgres 16 made this tunable per-query as BUFFER_USAGE_LIMIT. MySQL's midpoint LRU achieves the same goal differently: scan pages enter at the midpoint and only get promoted to the hot half if accessed again after innodb_old_blocks_time milliseconds (default 1000).

Sizing — why Postgres and MySQL disagree

The folk wisdom for shared_buffers is 25% of RAM, capped at maybe 8–16 GB on a dedicated box. For InnoDB the same wisdom says 50–75% of RAM, sometimes higher on a dedicated database server. The two recommendations look contradictory until you remember Postgres is double-buffered and InnoDB (with O_DIRECT) isn't.

Postgres at 25% leaves 75% of RAM for the OS page cache, which is the second tier of its cache hierarchy. Going higher starts to waste memory on the duplication. InnoDB at 75% leaves only enough for the kernel, the binlog, sort buffers, and connection memory — everything else is the buffer pool, because InnoDB is the cache.

Both numbers are starting points, not rules. The right answer is "as much as keeps the working set in RAM without making the OS swap or starving connections of work_mem", which you only find by measuring.

Prefetching — reading ahead of demand

A demand-fetch buffer pool waits for a page miss before issuing I/O. Prefetching lets the storage layer issue reads for pages it expects the query to want next. Sequential scans are the easy case — Postgres asks the OS via posix_fadvise(POSIX_FADV_SEQUENTIAL) and Linux readahead does the rest, fetching up to /sys/block/*/queue/read_ahead_kb (default 128 KB) ahead of the cursor.

Index scans are harder because the access pattern looks random to the storage layer. Postgres added effective_io_concurrency (default 1, set to 200+ on NVMe) so bitmap heap scans can issue parallel readaheads for the heap pages an index lookup points to. InnoDB has a separate linear and random read-ahead, both tunable via innodb_read_ahead_threshold.

Finding hot pages

Postgres exposes the buffer pool's contents through the pg_buffercache extension — one row per buffer, with relation, block number, usage count, and pin count. Joining against pg_class tells you which tables and indexes the cache is spending RAM on.

CREATE EXTENSION pg_buffercache;

SELECT  c.relname,
        count(*) AS buffers,
        pg_size_pretty(count(*) * 8192) AS size,
        round(100.0 * count(*) /
              (SELECT setting::int FROM pg_settings
               WHERE name = 'shared_buffers'), 1) AS pct_pool
FROM    pg_buffercache b
JOIN    pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 10;

 relname             | buffers | size    | pct_pool
---------------------+---------+---------+---------
 orders              |   48210 | 377 MB  |   18.4
 orders_user_id_idx  |   12044 |  94 MB  |    4.6
 line_items          |    9871 |  77 MB  |    3.8
 ...

Below the database, bpftrace and perf can show which OS-level pages are hot — useful when you suspect the kernel cache, not the database cache, is the bottleneck. MySQL exposes similar information through INFORMATION_SCHEMA.INNODB_BUFFER_PAGE, though the query is expensive enough that it shouldn't run on a busy primary.

Failure modes worth knowing

Production tuning of a buffer pool is mostly a fight against three things. Eviction storms happen when the workload's working set spills over the pool — every query becomes a miss, the clock sweep runs constantly, and CPU time goes to page replacement instead of query work. The fix is more RAM or a smaller working set (better indexes, partitioning).

Lock contention on the LRU list shows up on big-core machines — Postgres partitions its buffer-mapping locks into 128 partitions for exactly this reason, and InnoDB splits the buffer pool itself via innodb_buffer_pool_instances (default 8 when the pool is > 1 GB). Both are workarounds for the same problem.

False sharing on counters — usage counts and pin counts updated by many threads on the same cache line — quietly cap throughput on NUMA boxes. You won't find it in any config file; you find it in perf c2c or by reading the source.

Buffer pool config across engines

Engine	Page size	Buffer pool config	Default / typical
Postgres	8 KB	`shared_buffers`, double-buffered, clock-sweep	128 MB default; 25% of RAM typical
MySQL InnoDB	16 KB	`innodb_buffer_pool_size`, O_DIRECT, midpoint LRU	128 MB default; 50–75% of RAM typical
SQLite	4 KB (default)	`PRAGMA cache_size`, OS page cache only	2 MB default; usually left alone
SQL Server	8 KB	`max server memory`, dynamic, hashed LRU-K	Grows to fill RAM minus headroom
Oracle	2–32 KB	`DB_CACHE_SIZE`, touch-count, multiple pool sizes	Configured per workload

Buffer pool & page cache

The setup — pages, not rows

Hit ratio — the metric that matters

Eviction — LRU in the textbook, CLOCK in production

Pinning — pages you can't evict

Buffer pool vs OS page cache

Cold-data pollution

Sizing — why Postgres and MySQL disagree

Prefetching — reading ahead of demand

Finding hot pages

Failure modes worth knowing

Buffer pool config across engines

Further reading

03 — Write-ahead logging