What is the best book to learn database internals?

Two complementary books cover the field. Designing Data-Intensive Applications (Kleppmann, 2017) is the broader system-design view — replication, partitioning, transactions, distributed systems. Database Internals (Petrov, 2019) is the narrower storage-engine view — B-trees, LSM-trees, log management, recovery. Read DDIA first; Petrov second when you want the implementation details.

Should I learn SQL before learning database internals?

Yes — fluent SQL is the right starting point. EXPLAIN ANALYZE on real queries is how the abstract concepts (indexes, joins, optimiser stats) become concrete. Most database engineers find that 6-12 months of writing production SQL builds the intuition that makes internals study productive.

What is the difference between SQL and NoSQL databases?

SQL databases (Postgres, MySQL) enforce a relational schema and ACID transactions. NoSQL is an umbrella for: key-value (Redis, DynamoDB), document (MongoDB), wide-column (Cassandra, ScyllaDB), and graph (Neo4j, Neptune) stores. The trade-off is rarely "performance vs flexibility" — it's usually "schema enforcement at the database vs in the application." See the handbook entry on NoSQL databases for the four shapes and which fits which workload.

Database Internals Study Path — B-trees, LSM, MVCC, WAL, Replication

Why databases exist.

Two problems, repeated forever. Durability: when the power fails, the bytes you said you saved are still there. Concurrency: when ten thousand clients hit the same data at once, none of them sees a contradiction. Files can't do this on their own, and neither can the operating system. The database is the layer that buys both, with a surprisingly small set of primitives: pages, a write-ahead log, locks (or MVCC), and an index.

The relational model, Codd 1970, added a third gift: a logical schema independent of the physical layout. SQL queries describe what you want; the planner decides how to fetch it. Fifty years of competing models, hierarchical, network, object, document, key-value, graph, have all been absorbed back into relational engines (JSONB, GIN indexes, foreign tables, materialised views) without dethroning the core. Postgres in 2026 is the most capable database humans have ever built, and it ships under the same license it had in 1996.

When not to reach for "something webscale". Modern hardware: 1 TB RAM, 24 TB NVMe, 128 cores. A single Postgres instance handles tens of thousands of writes per second and terabytes of working set. The operational tax of distributed SQL (Spanner, CockroachDB, Vitess) is real. Use one node until it actually hurts, then add a read replica, then partition.

The twelve mental models you must build.

Twelve concepts cover ~95% of database surface. Get these in your bones in the first month. Every engine you meet (Postgres, MySQL, Cassandra, RocksDB, Spanner, ClickHouse) is a recombination of them.

01 B-tree Day-zero

Sorted, balanced, height ≈ log_B(N). The default index for read-heavy workloads — Postgres, MySQL/InnoDB, SQL Server. Range scans are cheap; random writes are not.

02 LSM-tree Practitioner

Log-structured merge-tree. Buffer writes in memory, flush sorted runs to disk, merge them in the background. Powers RocksDB, Cassandra, ScyllaDB, LevelDB — write-optimised by design.

03 Page / block Day-zero

The unit of disk I/O — typically 8 KB or 16 KB. Buffer pools, WAL records, indexes, heaps — every storage abstraction sits on top of pages.

04 Write-ahead log Practitioner

Append-only durability primitive. Write the log first, fsync, then update pages. Crash recovery replays the log. Every database, every replication protocol has one underneath.

05 MVCC Practitioner

Multi-version concurrency control. Each row has multiple visible versions tagged by transaction id. Readers don't block writers; writers don't block readers. The reason Postgres has VACUUM.

06 Isolation levels Practitioner

Read Uncommitted → Read Committed → Repeatable Read → Snapshot → Serializable. Each prevents one more concurrency anomaly. The default in most engines is Read Committed.

07 Locking Operator

Shared (S) and exclusive (X) locks at row, page, and table grain. Deadlock detection via wait-for graphs. Lock escalation when fine-grained locking exhausts memory.

08 Indexes Day-zero

B-tree (the default), hash (equality only), GIN (full-text, JSONB), GiST (geometric, range), bitmap (low-cardinality, analytics), partial (WHERE clause), expression (function output).

09 Query planner Practitioner

Cost-based optimiser. Reads statistics from pg_stats / information_schema, enumerates join orders, picks the lowest estimated cost. EXPLAIN ANALYZE shows what it actually did.

10 Replication Operator

Async (lossy on failover), sync (slow), semi-sync (one replica acks). Leader-follower (Postgres, MySQL) versus multi-master (Galera, BDR). Trade durability for latency.

11 Sharding & partitioning Operator

Split the dataset by hash, range, or geography. Range scans favour range partitioning; uniform load favours hash. Cross-shard joins and global indexes are the recurring pain.

12 Distributed consensus in DBs Engineer

Spanner uses Paxos under TrueTime. CockroachDB and TiDB use Raft per range. The reason a "distributed SQL" database can claim serialisable transactions across continents.

Day-zero — your first hour.

One hour. Open Designing Data-Intensive Applications and read chapters 3 through 7. Then run pgbench and read the EXPLAIN ANALYZE output of one of your own queries. The bar is muscle: you have read about storage and you have measured a real database under load.

# 1. Read DDIA chapters 3–7 (≈ 90 minutes)
#    — Storage and retrieval (B-trees, LSM-trees)
#    — Encoding and evolution
#    — Replication
#    — Partitioning
#    — Transactions

# 2. Spin up a local Postgres
docker run --name pg -e POSTGRES_PASSWORD=pw -p 5432:5432 -d postgres:16
docker exec -it pg psql -U postgres

# 3. Run pgbench — synthetic TPC-B-like workload
docker exec -it pg pgbench -i -s 10 postgres
docker exec -it pg pgbench -c 8 -T 30 postgres

# 4. EXPLAIN ANALYZE one query you actually run
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE user_id = 42;
#    — Read every line. Note Seq Scan vs Index Scan vs Bitmap Heap Scan.
#    — Note "actual time" vs "rows" — large mis-estimates mean stale stats.

# 5. (Optional, satisfying) read pg_stat_statements for the top-10 queries
CREATE EXTENSION pg_stat_statements;
SELECT query, calls, mean_exec_time
  FROM pg_stat_statements
  ORDER BY mean_exec_time DESC LIMIT 10;

Done. You have read the right five chapters, run a real benchmark, and looked at a real query plan. Everything below extends from this beachhead.

Week-1 to Month-3 — pick a track.

After the first hour you can read database writing without bouncing off it. Spend the next three months one track at a time, depth-first. Don't try to learn storage internals and distributed SQL in the same fortnight. Pick the track that maps to your job and finish it.

Storage internals

Pages, heaps, B-trees, LSM-trees, Bloom filters. Petrov's Database Internals is the right book. Open the InnoDB or Postgres source and follow a single SELECT through the buffer pool.

→ Reference

WAL & recovery

Write-ahead logging, checkpoints, ARIES (analysis / redo / undo). Read the ARIES paper twice. Crash a Postgres in a VM, watch it recover; read pg_waldump output line by line.

→ Reference

Concurrency control

MVCC, snapshot isolation, serializable snapshot isolation, two-phase locking. Read the Berenson / Adya papers. Reproduce the write-skew anomaly on snapshot isolation in your own database.

→ Reference

Query optimisation

System R's join enumeration. Cost models. Statistics. Plan stability and plan caching. EXPLAIN ANALYZE on real workloads. Reading Postgres' planner source is the deepest single jump you can make.

→ Reference

Replication & HA

Async, sync, semi-sync. Logical decoding. Failover, fencing, split-brain. Read pg_basebackup, set up streaming replication, kill the leader, observe the lag — then read the replication slot internals.

→ Reference

Distributed SQL

CockroachDB, TiDB, YugabyteDB, Spanner. Range-based sharding with per-range Raft. Distributed transactions via 2PC layered on consensus. Read the Spanner and CockroachDB papers in that order.

→ Reference

Analytics & columnar

C-Store / Vertica, ClickHouse, DuckDB, Snowflake. Columnar layouts, vectorised execution, compression. The OLAP world is a different planet — pick one engine and read it deeply.

→ Reference

The books that matter.

2017 · O'Reilly

Martin Kleppmann — Designing Data-Intensive Applications

DDIA. The narrative thread connecting storage, indexing, replication, partitioning, transactions, isolation. Chapters 3 and 7 are the ones that matter most for understanding internals.

2019 · O'Reilly

Alex Petrov — Database Internals

The best modern internals walkthrough. B-trees, LSM-trees, recovery, distributed-DB algorithms — the storage layer DDIA gestures at; Petrov works through it line by line.

2018 · MIT Press · free online

Stonebraker / Hellerstein — Readings in Database Systems (the "Red Book")

A curated, opinionated tour of the literature. Each chapter is a paper list with editorial commentary by Hellerstein and Stonebraker. Free online — the canonical syllabus.

2019 · McGraw-Hill

Silberschatz, Korth, Sudarshan — Database System Concepts

The foundational textbook. Drier than DDIA, broader than Petrov — relational algebra through to query optimisation, recovery, and distributed systems. The reference, not the page-turner.

1992 · Morgan Kaufmann

Gray & Reuter — Transaction Processing: Concepts and Techniques

The bible for transactions, recovery, and locking. Decades old, still not surpassed. If you implement an engine, you re-read this every year.

2012 · O'Reilly

Schwartz, Zaitsev, Tkachenko — High Performance MySQL (4th ed.)

The operator's book for MySQL. Storage engines, replication, query optimisation, observability. Even if you run Postgres, the operational instincts transfer.

2017 · self-published

Dimitri Fontaine — The Art of PostgreSQL

SQL as a programming language. Window functions, lateral joins, recursive CTEs, JSONB. The book that turns "I write Postgres queries" into "I think in Postgres".

Honourable mentions: PostgreSQL Internals (Egor Rogov, free online); SQL Performance Explained (Markus Winand — the print edition of Use-the-Index-Luke); Streaming Systems (Akidau et al — the data-pipeline complement to DDIA).

Courses, lectures, and where to actually learn.

Free

Paid (worth it)

Bradfield CS
Databases
Cohort-based, paper-driven. Pricey; the curated reading list and small-group discussion are the value. Bradfield's distributed-systems and OS courses pair well with this.

The paper canon.

Fourteen papers, roughly 1970 → 2017. Read them in order. The later ones cite the earlier ones constantly. Most are 8–20 pages.

01
1970 · Codd
A Relational Model of Data for Large Shared Data Banks
The founding paper. Relations, tuples, normal forms; the case for separating logical model from physical layout. Twelve pages that started the field.
02
1976 · Astrahan et al
System R: A Relational Approach to Database Management
IBM's prototype — the first relational DBMS. SQL, the cost-based optimiser, the storage manager. Almost every modern relational engine traces here.
03
1976 · Eswaran, Gray, Lorie, Traiger
The Notions of Consistency and Predicate Locks in a Database System
The original definition of degrees of consistency and predicate locking. Read it before reading anything else about isolation.
04
1976 · Gray et al
Granularity of Locks and Degrees of Consistency in a Shared Database
Multi-granularity locking and the four classic isolation degrees. The vocabulary every modern transaction manager still uses.
05
1992 · Mohan et al
ARIES: A Transaction Recovery Method
How WAL recovery actually works. Analysis, redo, undo. The thing every database does on crash recovery; the algorithm by which the WAL keeps its promise.
06
1995 · Berenson et al
A Critique of ANSI SQL Isolation Levels
The famous critique. ANSI's phenomena are ambiguous; snapshot isolation is missing; serialisable is the only honest level. Read this and you stop trusting "Repeatable Read".
07
1996 · O'Neil et al
The Log-Structured Merge-Tree (LSM-Tree)
The LSM paper. Buffer in memory, flush sorted runs, merge in the background. The architectural ancestor of LevelDB, RocksDB, Cassandra, ScyllaDB.
08
2000 · Adya
Generalized Isolation Level Definitions (PhD thesis chapter)
Adya's redefinition of isolation in terms of forbidden histories — implementation-independent, mathematically clean. The right vocabulary for arguing about isolation.
09
2005 · Stonebraker et al
C-Store: A Column-oriented DBMS
The columnar paper. Read columns, not rows; compress aggressively; vectorise execution. The architectural seed of Vertica, MonetDB, ClickHouse, every modern OLAP engine.
10
2006 · Chang et al
Bigtable: A Distributed Storage System for Structured Data
Wide-column store on GFS + Chubby. Tablets, SSTables, the LSM-style write path at planet scale. The grandparent of HBase, Cassandra, DynamoDB.
11
2007 · DeCandia et al
Dynamo: Amazon's Highly Available Key-value Store
Consistent hashing, vector clocks, sloppy quorums, hinted handoff. The paper every NoSQL system reaches back to; Cassandra, Riak, and DynamoDB are its descendants.
12
2012 · Thomson & Abadi
Calvin: Fast Distributed Transactions for Partitioned Database Systems
Deterministic transaction ordering as a primitive. Decide the order first, then run; the contrarian alternative to 2PC + consensus. FaunaDB descends directly from this.
13
2012 · Corbett et al
Spanner: Google's Globally-Distributed Database
Externally consistent transactions over a planet-scale database via TrueTime — bounded clock uncertainty backed by GPS + atomic clocks. The paper distributed-SQL is judged against.
14
2017 · Verbitski et al
Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
Push the redo log into the storage layer; let storage do the page reconstruction. The cloud-native rewrite of MySQL/Postgres — six-way replicated, quorum-write logs.

Going further: FoundationDB: A Distributed Unbundled Transactional Key Value Store (Zhou et al, 2021); The Design and Implementation of MonetDB; the Snowflake SIGMOD 2016 paper (cloud OLAP, separation of storage and compute); the ClickHouse design paper. Pair every paper with at least one open-source engine that implements it.

Talks worth your evening.

Hands-on environments — where to actually run things.

Theory without something you can run is fragile. Each of these is a manageable way to make a database push back when you make a mistake.

Environment	Cost	Best for
CMU BusTub	Free	Andy Pavlo's teaching DB. Implement the buffer pool, B+tree index, query executor, and concurrency control in C++. Graded by a hidden test suite — the closest thing to a real engine you can build solo.
pgbench / sysbench	Free	Synthetic OLTP benchmarks. Pgbench ships with Postgres; sysbench supports MySQL, Postgres, and arbitrary Lua workloads. Use to characterise your hardware before you tune anything.
pg_stat_statements + EXPLAIN	Free	Operate on a real Postgres. Enable pg_stat_statements; rank queries by total time; EXPLAIN ANALYZE the top ten. The single most productive operational habit you can form.
HammerDB	Free, open-source	TPC-C and TPC-H workloads against Postgres, MySQL, Oracle, SQL Server. Heavier than pgbench; approximates a realistic OLTP load. Useful for comparing engines or hardware honestly.
Jepsen	Free, open-source	Aphyr's framework. Spin up a real cluster (Cassandra, Cockroach, Mongo, etc.), hammer it under partition / latency / message loss, check for consistency violations. Steeper than the others; closer to what production failure looks like.

Isolation levels — cheat sheet.

Five levels, five anomalies. Print this and tape it next to the monitor. The famous footnote is that Snapshot Isolation prevents phantoms but allows write skew. That's why "real" Serializable matters for some workloads, and why Postgres' Serializable Snapshot Isolation (SSI) is the pragmatic compromise: snapshot semantics, with dependency tracking to detect (and abort) the dangerous histories.

Level	Dirty read	Non-repeatable read	Phantom	Write skew	Lost update
Read Uncommitted	Possible	Possible	Possible	Possible	Possible
Read Committed	Prevented	Possible	Possible	Possible	Possible
Repeatable Read	Prevented	Prevented	Possible (spec)	Possible	Prevented (some)
Snapshot Isolation	Prevented	Prevented	Prevented	Possible	Prevented
Serializable	Prevented	Prevented	Prevented	Prevented	Prevented

Defaults: Postgres → Read Committed; MySQL/InnoDB → Repeatable Read (with phantom-prevention via gap locks); Oracle → Read Committed; SQL Server → Read Committed; Spanner / CockroachDB → Serializable. Most apps run at Read Committed and assume Serializable.

Common mistakes that ship to production.

Patterns every team writes at least once. Read these now so you recognise the shape later, when something on-call is misbehaving and the dashboard is unhelpful.

No indexes on join columns / FKs: Every join over a non-indexed column is a sequential scan. The DB will not warn you. EXPLAIN ANALYZE will show "Hash Join" with a child Seq Scan; that is your sign.
N+1 queries: Load 100 users, then issue 100 separate queries for their profiles. ORMs make this invisible by default. Eager-load joins or batch the second query — log per-request query counts and alert on the tail.
VARCHAR(MAX) / TEXT when you wanted a constraint: Storing email as TEXT means the DB will accept a paragraph. A length cap, a CHECK constraint, or a domain type enforces the contract at the layer that survives bugs in every other layer.
Wrong isolation level by accident: Postgres, MySQL, Oracle all default to Read Committed. Devs often write code that assumes serialisable. Half the "weird race condition in production" tickets are anomalies the dev didn't know could happen at RC.
Long transactions blocking VACUUM: Postgres can't reclaim row versions older than the oldest open transaction. A connection leaked open for a day → table bloats indefinitely → query plans degrade. Monitor pg_stat_activity for long idle-in-transaction sessions.
Connection pool exhaustion: App pool size × replicas easily exceeds DB max_connections. The DB doesn't scale with connections — each one is RAM and a backend process. Use PgBouncer; size the pool to (cores × 2) plus a little headroom.
Leading-wildcard LIKE queries: "WHERE name LIKE %smith%" cannot use a B-tree index; it is a sequential scan. Reach for a trigram index (pg_trgm) or full-text search; redesign the query before scaling the hardware.
Storing JSON when you should store columns: JSONB is for shape that varies row to row. If every row has the same keys, columns are faster, smaller, and check-constrainable. The inverse mistake — modelling a tag-list as 50 nullable columns — is also catastrophic.
ORM lazy-loading triggering unexpected queries: Iterating over a relation collection issues a query per element. The fix is per-ORM (eager-loading, prefetch, includes), but the symptom is universal: a request that should be one query becomes a thousand.
No WAL archive / PITR on production: Streaming replication is not a backup — it replicates DROP TABLE just as fast. Configure WAL archiving + base backups; rehearse point-in-time recovery on a real schedule. The first time you restore should not be the day you need to.

Practice deck.

Ten cards: the questions interviewers ask, the things that bite operators in production, and the trivia that separates "I use a database" from "I understand one".

Card 1 of 10

What does a B-tree give you that a hash index does not?

Suggested sequences

Reading progressions

Three ordered paths through this material — pick the one that matches where you are.

Path 01 · Storage

Storage internals

The structures that make databases fast: B-trees, WAL, and LSM trees from first principles.

Path 02 · Transactions

Transactions & isolation

ACID, the isolation levels, and why the same code behaves differently under different databases.

Path 03 · Distribution

Sharding & replication

How databases scale horizontally: sharding strategies, replication lag, and the CAP trade-offs.

Keep going.

Databases reward reading and re-reading. The same ARIES paper, read on day 30 and again on day 300, gives you different things. So will DDIA. So will every Pavlo lecture. The field is not large. It is dense, and it has been compounding for fifty years.

Pick one real engine and read its source for an afternoon. Postgres, SQLite, RocksDB, DuckDB, CockroachDB are all open. Pair what you read with the paper that inspired it. Then come back to your own queries, your own schemas, your own indexes. You will rewrite some of them.

→ Semicolony's paper index → Continue with the Semicolony Handbook → The people behind these papers → How CPU caches feed databases → How SSDs work (FTL, write amplification) → Other study paths

Databases,learned properly.

Why databases exist.

The twelve mental models you must build.

Day-zero — your first hour.

Week-1 to Month-3 — pick a track.

The books that matter.

Courses, lectures, and where to actually learn.

The paper canon.

Talks worth your evening.

Hands-on environments — where to actually run things.

Isolation levels — cheat sheet.

Common mistakes that ship to production.

Practice deck.

Reading progressions

Keep going.

Databases,
learned properly.