Buffer pool & page cache
The buffer pool is the in-memory layer that sits between the SQL executor and disk. A page in RAM is roughly 100,000× faster to read than the same page on an NVMe SSD, and the gap is even wider for spinning disks. Almost everything a database does to feel fast — eviction policy, prefetching, double-buffering, pinning — happens inside this cache.
The setup — pages, not rows
Databases don't read rows from disk, they read pages — fixed-size blocks that the storage layer treats as the atomic unit of I/O. Postgres uses 8 KB pages, MySQL InnoDB defaults to 16 KB, SQL Server uses 8 KB, Oracle is configurable (2–32 KB), and SQLite picks per-database between 512 B and 64 KB (4 KB default since 3.12).
The size is chosen to match the OS page (4 KB on Linux, 16 KB on Apple Silicon) and the SSD's native block. Read amplification on a small row is the cost of this choice: fetching one 100-byte tuple still drags an 8 KB page into the buffer pool. The buffer pool exists so the next request for any of the other tuples on that page is free.
Numerically: a DRAM read is ~80 ns; an NVMe page read is ~100 μs; a spinning-disk seek is ~10 ms. The buffer pool is the layer that turns the second number into the first.
Hit ratio — the metric that matters
The single number every DBA watches is the buffer pool hit ratio — the fraction of page requests served from RAM rather than disk. Healthy OLTP workloads run at 99% or higher. Warm analytics sit at 80–95%. Cold OLAP scans, where the working set is much larger than RAM, will happily run at 0–50% and there's not much you can do about it.
The non-linearity is brutal. A workload at 99.5% hit ratio that drops to 98.5% has tripled its disk reads. That's usually the difference between sub-millisecond and tens-of-milliseconds query latency, which is the difference between an app that feels instant and one users complain about.
Eviction — LRU in the textbook, CLOCK in production
When the buffer pool is full and a new page needs to come in, something has to leave. Least-Recently-Used is the textbook answer: keep a doubly-linked list ordered by access time, evict the tail. In production this rarely ships unchanged — touching the list on every page access means taking a lock on a structure every query thread contends for, and the cache-line bouncing kills throughput on a 64-core box.
Postgres uses a clock-sweep variant: each buffer has a usage counter that increments on access (capped at ~5) and a sweep hand decrements counters as it looks for a victim with count zero. No global LRU list, no per-access list shuffle. MySQL InnoDB uses a midpoint-insertion LRU: newly-loaded pages enter the list 3/8 of the way down, not at the head, so a single big scan can't promote a flood of one-shot pages above the working set. Oracle uses a touch-count algorithm with similar intent.
Pinning — pages you can't evict
A page being actively read or written is pinned — its pin count is non-zero and the eviction logic must skip it. Pinning is short-lived in OLTP (microseconds per row fetch) but a long sequential scan that holds dozens of pages pinned can starve the rest of the workload of replacement candidates.
Postgres caps each backend's local pin count to keep one query from monopolising the pool;
it surfaces as "BufferPin" wait events in pg_stat_activity. InnoDB does
similar accounting via buf_pool->free_list and the LRU_old
fraction. A workload that produces a lot of BufferPin waits is usually one with
long-running cursors or a streaming replica that's fallen behind.
Buffer pool vs OS page cache
There are two designs in the wild. Single-buffered systems rely entirely
on the OS page cache — they use buffered read()/write() and let
the kernel manage what stays hot. SQLite, by default, works this way. The benefit is
simplicity; the cost is giving up control over eviction, prefetch, and write ordering.
Double-buffered systems keep their own buffer pool and sit on
top of the OS cache. Postgres is the canonical example: it asks the kernel for pages
normally, so warm data is cached twice — once in shared_buffers, once in the
page cache underneath. Memory is wasted on the duplication, but Postgres gets to evict
based on its own knowledge of which pages back hot indexes.
innodb_flush_method = O_DIRECT
on Linux, which bypasses the OS page cache entirely. The reasoning: InnoDB already manages
16 KB pages in its own buffer pool with a smarter midpoint LRU than the kernel's, and
double-buffering wastes RAM that could go to innodb_buffer_pool_size. The
cost is losing the kernel's read-ahead for full table scans — InnoDB's own prefetcher has
to make up the gap.Cold-data pollution
The pathological case for any cache is a one-shot scan that's larger than the cache
itself. A single SELECT * FROM big_table can evict every useful page,
leaving the buffer pool full of data the next query will never touch. Hit ratio
collapses, latency spikes for everything else, and the cache needs minutes to rewarm.
Postgres defends against this with a ring buffer for sequential scans —
roughly 256 KB of dedicated buffers reused for the duration of the scan, so the rest of
shared_buffers is untouched. Postgres 16 made this tunable per-query as
BUFFER_USAGE_LIMIT. MySQL's midpoint LRU achieves the same goal differently:
scan pages enter at the midpoint and only get promoted to the hot half if accessed again
after innodb_old_blocks_time milliseconds (default 1000).
Sizing — why Postgres and MySQL disagree
The folk wisdom for shared_buffers is 25% of RAM, capped at maybe 8–16 GB
on a dedicated box. For InnoDB the same wisdom says 50–75% of RAM, sometimes higher on a
dedicated database server. The two recommendations look contradictory until you remember
Postgres is double-buffered and InnoDB (with O_DIRECT) isn't.
Postgres at 25% leaves 75% of RAM for the OS page cache, which is the second tier of its cache hierarchy. Going higher starts to waste memory on the duplication. InnoDB at 75% leaves only enough for the kernel, the binlog, sort buffers, and connection memory — everything else is the buffer pool, because InnoDB is the cache.
Both numbers are starting points, not rules. The right answer is "as much as keeps the working set in RAM without making the OS swap or starving connections of work_mem", which you only find by measuring.
Prefetching — reading ahead of demand
A demand-fetch buffer pool waits for a page miss before issuing I/O. Prefetching lets the
storage layer issue reads for pages it expects the query to want next. Sequential scans
are the easy case — Postgres asks the OS via posix_fadvise(POSIX_FADV_SEQUENTIAL)
and Linux readahead does the rest, fetching up to /sys/block/*/queue/read_ahead_kb
(default 128 KB) ahead of the cursor.
Index scans are harder because the access pattern looks random to the storage layer.
Postgres added effective_io_concurrency (default 1, set to 200+ on NVMe) so
bitmap heap scans can issue parallel readaheads for the heap pages an index lookup
points to. InnoDB has a separate linear and random
read-ahead, both tunable via innodb_read_ahead_threshold.
Finding hot pages
Postgres exposes the buffer pool's contents through the pg_buffercache
extension — one row per buffer, with relation, block number, usage count, and pin count.
Joining against pg_class tells you which tables and indexes the cache is
spending RAM on.
CREATE EXTENSION pg_buffercache;
SELECT c.relname,
count(*) AS buffers,
pg_size_pretty(count(*) * 8192) AS size,
round(100.0 * count(*) /
(SELECT setting::int FROM pg_settings
WHERE name = 'shared_buffers'), 1) AS pct_pool
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 10;
relname | buffers | size | pct_pool
---------------------+---------+---------+---------
orders | 48210 | 377 MB | 18.4
orders_user_id_idx | 12044 | 94 MB | 4.6
line_items | 9871 | 77 MB | 3.8
...Below the database, bpftrace and perf can show which OS-level
pages are hot — useful when you suspect the kernel cache, not the database cache, is the
bottleneck. MySQL exposes similar information through
INFORMATION_SCHEMA.INNODB_BUFFER_PAGE, though the query is expensive enough
that it shouldn't run on a busy primary.
Failure modes worth knowing
Production tuning of a buffer pool is mostly a fight against three things. Eviction storms happen when the workload's working set spills over the pool — every query becomes a miss, the clock sweep runs constantly, and CPU time goes to page replacement instead of query work. The fix is more RAM or a smaller working set (better indexes, partitioning).
Lock contention on the LRU list shows up on big-core machines — Postgres
partitions its buffer-mapping locks into 128 partitions for exactly this reason, and
InnoDB splits the buffer pool itself via innodb_buffer_pool_instances
(default 8 when the pool is > 1 GB). Both are workarounds for the same problem.
False sharing on counters — usage counts and pin counts updated by many
threads on the same cache line — quietly cap throughput on NUMA boxes. You won't find it
in any config file; you find it in perf c2c or by reading the source.
Buffer pool config across engines
| Engine | Page size | Buffer pool config | Default / typical |
|---|---|---|---|
| Postgres | 8 KB | shared_buffers, double-buffered, clock-sweep | 128 MB default; 25% of RAM typical |
| MySQL InnoDB | 16 KB | innodb_buffer_pool_size, O_DIRECT, midpoint LRU | 128 MB default; 50–75% of RAM typical |
| SQLite | 4 KB (default) | PRAGMA cache_size, OS page cache only | 2 MB default; usually left alone |
| SQL Server | 8 KB | max server memory, dynamic, hashed LRU-K | Grows to fill RAM minus headroom |
| Oracle | 2–32 KB | DB_CACHE_SIZE, touch-count, multiple pool sizes | Configured per workload |
Further reading
- Suzuki — The Internals of PostgreSQL
— chapter 8 walks through
shared_buffers, the buffer descriptor array, and the clock-sweep algorithm with diagrams. - Jeremy Cole — On learning InnoDB — the deepest blog series anywhere on InnoDB internals; the buffer pool, LRU, and flush list posts are essential.
- Hellerstein, Stonebraker, Hamilton — Architecture of a Database System — section 4 on the buffer manager is the classical reference, and still the cleanest explanation of why double-buffering is a defensible choice.
- Postgres docs — pg_buffercache — the extension, its columns, and the warning about its overhead on busy systems.
- MySQL docs — The InnoDB Buffer Pool — configuration knobs, midpoint LRU details, and the buffer pool dump/restore feature that survives restarts.
- Semicolony — LSM-trees
— the other half of the cache story: LSM engines have their own block cache (RocksDB's
BlockCache) with the same hit-ratio dynamics as a B-tree buffer pool. - Semicolony — The Linux page cache
— what the kernel is doing underneath a double-buffered database, and why dirty-page
flushing interacts with
fsync()latency.