Why databases exist.
Two problems, repeated forever. Durability: when the power fails, the bytes you said you saved are still there. Concurrency: when ten thousand clients hit the same data at once, none of them sees a contradiction. Files can't do this on their own, and neither can the operating system. The database is the layer that buys both, with a surprisingly small set of primitives: pages, a write-ahead log, locks (or MVCC), and an index.
The relational model, Codd 1970, added a third gift: a logical schema independent of the physical layout. SQL queries describe what you want; the planner decides how to fetch it. Fifty years of competing models, hierarchical, network, object, document, key-value, graph, have all been absorbed back into relational engines (JSONB, GIN indexes, foreign tables, materialised views) without dethroning the core. Postgres in 2026 is the most capable database humans have ever built, and it ships under the same license it had in 1996.
When not to reach for "something webscale". Modern hardware: 1 TB RAM, 24 TB NVMe, 128 cores. A single Postgres instance handles tens of thousands of writes per second and terabytes of working set. The operational tax of distributed SQL (Spanner, CockroachDB, Vitess) is real. Use one node until it actually hurts, then add a read replica, then partition.
The twelve mental models you must build.
Twelve concepts cover ~95% of database surface. Get these in your bones in the first month. Every engine you meet (Postgres, MySQL, Cassandra, RocksDB, Spanner, ClickHouse) is a recombination of them.
Sorted, balanced, height ≈ log_B(N). The default index for read-heavy workloads — Postgres, MySQL/InnoDB, SQL Server. Range scans are cheap; random writes are not.
02 LSM-tree PractitionerLog-structured merge-tree. Buffer writes in memory, flush sorted runs to disk, merge them in the background. Powers RocksDB, Cassandra, ScyllaDB, LevelDB — write-optimised by design.
03 Page / block Day-zeroThe unit of disk I/O — typically 8 KB or 16 KB. Buffer pools, WAL records, indexes, heaps — every storage abstraction sits on top of pages.
04 Write-ahead log PractitionerAppend-only durability primitive. Write the log first, fsync, then update pages. Crash recovery replays the log. Every database, every replication protocol has one underneath.
05 MVCC PractitionerMulti-version concurrency control. Each row has multiple visible versions tagged by transaction id. Readers don't block writers; writers don't block readers. The reason Postgres has VACUUM.
06 Isolation levels PractitionerRead Uncommitted → Read Committed → Repeatable Read → Snapshot → Serializable. Each prevents one more concurrency anomaly. The default in most engines is Read Committed.
07 Locking OperatorShared (S) and exclusive (X) locks at row, page, and table grain. Deadlock detection via wait-for graphs. Lock escalation when fine-grained locking exhausts memory.
08 Indexes Day-zeroB-tree (the default), hash (equality only), GIN (full-text, JSONB), GiST (geometric, range), bitmap (low-cardinality, analytics), partial (WHERE clause), expression (function output).
09 Query planner PractitionerCost-based optimiser. Reads statistics from pg_stats / information_schema, enumerates join orders, picks the lowest estimated cost. EXPLAIN ANALYZE shows what it actually did.
10 Replication OperatorAsync (lossy on failover), sync (slow), semi-sync (one replica acks). Leader-follower (Postgres, MySQL) versus multi-master (Galera, BDR). Trade durability for latency.
11 Sharding & partitioning OperatorSplit the dataset by hash, range, or geography. Range scans favour range partitioning; uniform load favours hash. Cross-shard joins and global indexes are the recurring pain.
12 Distributed consensus in DBs EngineerSpanner uses Paxos under TrueTime. CockroachDB and TiDB use Raft per range. The reason a "distributed SQL" database can claim serialisable transactions across continents.
Day-zero — your first hour.
One hour. Open Designing Data-Intensive Applications and read chapters 3 through 7. Then run pgbench and read the EXPLAIN ANALYZE output of one of your own queries. The bar is muscle: you have read about storage and you have measured a real database under load.
# 1. Read DDIA chapters 3–7 (≈ 90 minutes) # — Storage and retrieval (B-trees, LSM-trees) # — Encoding and evolution # — Replication # — Partitioning # — Transactions # 2. Spin up a local Postgres docker run --name pg -e POSTGRES_PASSWORD=pw -p 5432:5432 -d postgres:16 docker exec -it pg psql -U postgres # 3. Run pgbench — synthetic TPC-B-like workload docker exec -it pg pgbench -i -s 10 postgres docker exec -it pg pgbench -c 8 -T 30 postgres # 4. EXPLAIN ANALYZE one query you actually run EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE user_id = 42; # — Read every line. Note Seq Scan vs Index Scan vs Bitmap Heap Scan. # — Note "actual time" vs "rows" — large mis-estimates mean stale stats. # 5. (Optional, satisfying) read pg_stat_statements for the top-10 queries CREATE EXTENSION pg_stat_statements; SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;
Done. You have read the right five chapters, run a real benchmark, and looked at a real query plan. Everything below extends from this beachhead.
Week-1 to Month-3 — pick a track.
After the first hour you can read database writing without bouncing off it. Spend the next three months one track at a time, depth-first. Don't try to learn storage internals and distributed SQL in the same fortnight. Pick the track that maps to your job and finish it.
Pages, heaps, B-trees, LSM-trees, Bloom filters. Petrov's Database Internals is the right book. Open the InnoDB or Postgres source and follow a single SELECT through the buffer pool.
→ ReferenceWrite-ahead logging, checkpoints, ARIES (analysis / redo / undo). Read the ARIES paper twice. Crash a Postgres in a VM, watch it recover; read pg_waldump output line by line.
→ ReferenceMVCC, snapshot isolation, serializable snapshot isolation, two-phase locking. Read the Berenson / Adya papers. Reproduce the write-skew anomaly on snapshot isolation in your own database.
→ ReferenceSystem R's join enumeration. Cost models. Statistics. Plan stability and plan caching. EXPLAIN ANALYZE on real workloads. Reading Postgres' planner source is the deepest single jump you can make.
→ ReferenceAsync, sync, semi-sync. Logical decoding. Failover, fencing, split-brain. Read pg_basebackup, set up streaming replication, kill the leader, observe the lag — then read the replication slot internals.
→ ReferenceCockroachDB, TiDB, YugabyteDB, Spanner. Range-based sharding with per-range Raft. Distributed transactions via 2PC layered on consensus. Read the Spanner and CockroachDB papers in that order.
→ ReferenceC-Store / Vertica, ClickHouse, DuckDB, Snowflake. Columnar layouts, vectorised execution, compression. The OLAP world is a different planet — pick one engine and read it deeply.
→ ReferenceThe books that matter.
DDIA. The narrative thread connecting storage, indexing, replication, partitioning, transactions, isolation. Chapters 3 and 7 are the ones that matter most for understanding internals.
The best modern internals walkthrough. B-trees, LSM-trees, recovery, distributed-DB algorithms — the storage layer DDIA gestures at; Petrov works through it line by line.
A curated, opinionated tour of the literature. Each chapter is a paper list with editorial commentary by Hellerstein and Stonebraker. Free online — the canonical syllabus.
The foundational textbook. Drier than DDIA, broader than Petrov — relational algebra through to query optimisation, recovery, and distributed systems. The reference, not the page-turner.
The bible for transactions, recovery, and locking. Decades old, still not surpassed. If you implement an engine, you re-read this every year.
The operator's book for MySQL. Storage engines, replication, query optimisation, observability. Even if you run Postgres, the operational instincts transfer.
SQL as a programming language. Window functions, lateral joins, recursive CTEs, JSONB. The book that turns "I write Postgres queries" into "I think in Postgres".
Honourable mentions: PostgreSQL Internals (Egor Rogov, free online); SQL Performance Explained (Markus Winand — the print edition of Use-the-Index-Luke); Streaming Systems (Akidau et al — the data-pipeline complement to DDIA).
Courses, lectures, and where to actually learn.
- Carnegie Mellon · 15-445Database SystemsAndy Pavlo's flagship course. Free lectures, slides, and the BusTub project — implement a buffer pool, B+tree, query executor, and concurrency control in C++. The single best self-study path in databases.
- Carnegie Mellon · 15-721Advanced Database SystemsPavlo's grad sequel. Storage, query optimisation, vectorised execution, modern OLAP. Each lecture pairs with a paper; together they map the modern frontier.
- Stanford · CS245Database System PrinciplesThe Stanford counterpart. Strong on transaction theory and recovery. Good complement to CMU's project-heavy approach.
- UC Berkeley · CS186Introduction to Database SystemsThe undergraduate Berkeley course. Free lectures and projects — RookieDB is the implementation track. Approachable on-ramp before tackling 15-445.
- Markus WinandUse the Index, Luke!A free book on SQL indexes — how the planner uses them, how to read EXPLAIN, why your query is slow. The single highest-use afternoon for working app developers.
The paper canon.
Fourteen papers, roughly 1970 → 2017. Read them in order. The later ones cite the earlier ones constantly. Most are 8–20 pages.
- 01 A Relational Model of Data for Large Shared Data Banks
The founding paper. Relations, tuples, normal forms; the case for separating logical model from physical layout. Twelve pages that started the field.
- 02 System R: A Relational Approach to Database Management
IBM's prototype — the first relational DBMS. SQL, the cost-based optimiser, the storage manager. Almost every modern relational engine traces here.
- 03 The Notions of Consistency and Predicate Locks in a Database System
The original definition of degrees of consistency and predicate locking. Read it before reading anything else about isolation.
- 04 Granularity of Locks and Degrees of Consistency in a Shared Database
Multi-granularity locking and the four classic isolation degrees. The vocabulary every modern transaction manager still uses.
- 05 ARIES: A Transaction Recovery Method
How WAL recovery actually works. Analysis, redo, undo. The thing every database does on crash recovery; the algorithm by which the WAL keeps its promise.
- 06 A Critique of ANSI SQL Isolation Levels
The famous critique. ANSI's phenomena are ambiguous; snapshot isolation is missing; serialisable is the only honest level. Read this and you stop trusting "Repeatable Read".
- 07 The Log-Structured Merge-Tree (LSM-Tree)
The LSM paper. Buffer in memory, flush sorted runs, merge in the background. The architectural ancestor of LevelDB, RocksDB, Cassandra, ScyllaDB.
- 08 Generalized Isolation Level Definitions (PhD thesis chapter)
Adya's redefinition of isolation in terms of forbidden histories — implementation-independent, mathematically clean. The right vocabulary for arguing about isolation.
- 09 C-Store: A Column-oriented DBMS
The columnar paper. Read columns, not rows; compress aggressively; vectorise execution. The architectural seed of Vertica, MonetDB, ClickHouse, every modern OLAP engine.
- 10 Bigtable: A Distributed Storage System for Structured Data
Wide-column store on GFS + Chubby. Tablets, SSTables, the LSM-style write path at planet scale. The grandparent of HBase, Cassandra, DynamoDB.
- 11 Dynamo: Amazon's Highly Available Key-value Store
Consistent hashing, vector clocks, sloppy quorums, hinted handoff. The paper every NoSQL system reaches back to; Cassandra, Riak, and DynamoDB are its descendants.
- 12 Calvin: Fast Distributed Transactions for Partitioned Database Systems
Deterministic transaction ordering as a primitive. Decide the order first, then run; the contrarian alternative to 2PC + consensus. FaunaDB descends directly from this.
- 13 Spanner: Google's Globally-Distributed Database
Externally consistent transactions over a planet-scale database via TrueTime — bounded clock uncertainty backed by GPS + atomic clocks. The paper distributed-SQL is judged against.
- 14 Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
Push the redo log into the storage layer; let storage do the page reconstruction. The cloud-native rewrite of MySQL/Postgres — six-way replicated, quorum-write logs.
Going further: FoundationDB: A Distributed Unbundled Transactional Key Value Store (Zhou et al, 2021); The Design and Implementation of MonetDB; the Snowflake SIGMOD 2016 paper (cloud OLAP, separation of storage and compute); the ClickHouse design paper. Pair every paper with at least one open-source engine that implements it.
Talks worth your evening.
- Andy Pavlo · CMU15-445 lecture seriesPavlo's lectures on YouTube. Funny, sharp, paper-heavy. Watch in order with the slides open. The single highest-quality video resource on database internals.
- Bruce MomjianHow Postgres Stores Data on DiskBruce Momjian — Postgres core team since 1996 — walks through the on-disk format, the visibility map, MVCC, vacuum, and crash recovery. Slide decks freely available; videos on YouTube.
- Kyle Kingsbury · QCon / Strange LoopJepsen analysesAphyr's QCon and Strange Loop talks. Funny, technical, devastating. Watch one to internalise why your "serializable" database probably loses data under partition.
- Frank McSherryDifferential Dataflow / MaterializeThe principled streaming alternative — incrementally maintained relational queries with strong consistency. McSherry's talks reframe what a "view" can be.
- Pat HellandStanding on Distributed Shoulders of GiantsHelland is the most underrated voice in databases. Standing on Distributed Shoulders, Life Beyond Distributed Transactions, Immutability Changes Everything — watch all three.
- Martin Kleppmann · Strange Loop 2015Turning the database inside-outKleppmann's argument that the log is the primary, the database the derived view. The talk that re-framed how a generation thought about CDC, event sourcing, and stream processing.
Hands-on environments — where to actually run things.
Theory without something you can run is fragile. Each of these is a manageable way to make a database push back when you make a mistake.
| Environment | Cost | Best for |
|---|---|---|
| CMU BusTub | Free | Andy Pavlo's teaching DB. Implement the buffer pool, B+tree index, query executor, and concurrency control in C++. Graded by a hidden test suite — the closest thing to a real engine you can build solo. |
| pgbench / sysbench | Free | Synthetic OLTP benchmarks. Pgbench ships with Postgres; sysbench supports MySQL, Postgres, and arbitrary Lua workloads. Use to characterise your hardware before you tune anything. |
| pg_stat_statements + EXPLAIN | Free | Operate on a real Postgres. Enable pg_stat_statements; rank queries by total time; EXPLAIN ANALYZE the top ten. The single most productive operational habit you can form. |
| HammerDB | Free, open-source | TPC-C and TPC-H workloads against Postgres, MySQL, Oracle, SQL Server. Heavier than pgbench; approximates a realistic OLTP load. Useful for comparing engines or hardware honestly. |
| Jepsen | Free, open-source | Aphyr's framework. Spin up a real cluster (Cassandra, Cockroach, Mongo, etc.), hammer it under partition / latency / message loss, check for consistency violations. Steeper than the others; closer to what production failure looks like. |
Isolation levels — cheat sheet.
Five levels, five anomalies. Print this and tape it next to the monitor. The famous footnote is that Snapshot Isolation prevents phantoms but allows write skew. That's why "real" Serializable matters for some workloads, and why Postgres' Serializable Snapshot Isolation (SSI) is the pragmatic compromise: snapshot semantics, with dependency tracking to detect (and abort) the dangerous histories.
| Level | Dirty read | Non-repeatable read | Phantom | Write skew | Lost update |
|---|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | Possible | Possible |
| Read Committed | Prevented | Possible | Possible | Possible | Possible |
| Repeatable Read | Prevented | Prevented | Possible (spec) | Possible | Prevented (some) |
| Snapshot Isolation | Prevented | Prevented | Prevented | Possible | Prevented |
| Serializable | Prevented | Prevented | Prevented | Prevented | Prevented |
Defaults: Postgres → Read Committed; MySQL/InnoDB → Repeatable Read (with phantom-prevention via gap locks); Oracle → Read Committed; SQL Server → Read Committed; Spanner / CockroachDB → Serializable. Most apps run at Read Committed and assume Serializable.
Common mistakes that ship to production.
Patterns every team writes at least once. Read these now so you recognise the shape later, when something on-call is misbehaving and the dashboard is unhelpful.
- No indexes on join columns / FKs
- Every join over a non-indexed column is a sequential scan. The DB will not warn you. EXPLAIN ANALYZE will show "Hash Join" with a child Seq Scan; that is your sign.
- N+1 queries
- Load 100 users, then issue 100 separate queries for their profiles. ORMs make this invisible by default. Eager-load joins or batch the second query — log per-request query counts and alert on the tail.
- VARCHAR(MAX) / TEXT when you wanted a constraint
- Storing email as TEXT means the DB will accept a paragraph. A length cap, a CHECK constraint, or a domain type enforces the contract at the layer that survives bugs in every other layer.
- Wrong isolation level by accident
- Postgres, MySQL, Oracle all default to Read Committed. Devs often write code that assumes serialisable. Half the "weird race condition in production" tickets are anomalies the dev didn't know could happen at RC.
- Long transactions blocking VACUUM
- Postgres can't reclaim row versions older than the oldest open transaction. A connection leaked open for a day → table bloats indefinitely → query plans degrade. Monitor pg_stat_activity for long idle-in-transaction sessions.
- Connection pool exhaustion
- App pool size × replicas easily exceeds DB max_connections. The DB doesn't scale with connections — each one is RAM and a backend process. Use PgBouncer; size the pool to (cores × 2) plus a little headroom.
- Leading-wildcard LIKE queries
- "WHERE name LIKE %smith%" cannot use a B-tree index; it is a sequential scan. Reach for a trigram index (pg_trgm) or full-text search; redesign the query before scaling the hardware.
- Storing JSON when you should store columns
- JSONB is for shape that varies row to row. If every row has the same keys, columns are faster, smaller, and check-constrainable. The inverse mistake — modelling a tag-list as 50 nullable columns — is also catastrophic.
- ORM lazy-loading triggering unexpected queries
- Iterating over a relation collection issues a query per element. The fix is per-ORM (eager-loading, prefetch, includes), but the symptom is universal: a request that should be one query becomes a thousand.
- No WAL archive / PITR on production
- Streaming replication is not a backup — it replicates DROP TABLE just as fast. Configure WAL archiving + base backups; rehearse point-in-time recovery on a real schedule. The first time you restore should not be the day you need to.
Practice deck.
Ten cards: the questions interviewers ask, the things that bite operators in production, and the trivia that separates "I use a database" from "I understand one".
Reading progressions
Three ordered paths through this material — pick the one that matches where you are.
The structures that make databases fast: B-trees, WAL, and LSM trees from first principles.
ACID, the isolation levels, and why the same code behaves differently under different databases.
How databases scale horizontally: sharding strategies, replication lag, and the CAP trade-offs.
Keep going.
Databases reward reading and re-reading. The same ARIES paper, read on day 30 and again on day 300, gives you different things. So will DDIA. So will every Pavlo lecture. The field is not large. It is dense, and it has been compounding for fifty years.
Pick one real engine and read its source for an afternoon. Postgres, SQLite, RocksDB, DuckDB, CockroachDB are all open. Pair what you read with the paper that inspired it. Then come back to your own queries, your own schemas, your own indexes. You will rewrite some of them.