Database scaling — System Design Handbook

Scaling the app layer is easy — copy it. Scaling the database is hard because it holds the one thing you can't lose. Three techniques solve almost everything: replicate, partition, shard.

Scaling the application layer is easy: add more stateless servers behind a load balancer. Scaling the database is the hardest problem in distributed systems, because the database holds your most valuable asset: data that must be durable, consistent, and available. This module covers the three fundamental techniques — replication, partitioning, and sharding — alongside the theoretical framework (CAP, PACELC) that governs every choice you'll make.

Replication scales reads and provides failover. Sharding scales writes and storage. Real systems do both: each shard is itself a replicated cluster.

The three techniques — and what each one solves

Replication: Copy the full dataset to N machines. Reads can hit any copy; writes go to the primary (or to all, depending on the model). Solves: read scale, durability, failover.
Partitioning: Split the data within a single database into independent units (per-table or per-tenant) on the same machine. Solves: index size, lock contention, vacuum/maintenance windows.
Sharding: Split the data across machines. Each machine holds a subset; routing logic decides which one. Solves: write scale and storage scale, the only two things replication doesn't.

Replication — the four flavours

Every replication system is a point on the trade-off curve between consistency and availability. Pick the model that matches the workload, not the one your database happens to default to.

Async (primary → replica)

Primary writes, returns success, replicates in the background. Fast writes, strong durability on primary. Replicas can lag seconds to minutes. Risk: if primary dies, recent writes are gone.

Semi-sync

Primary waits for at least one replica to ack before returning success. Adds one round-trip but bounds the data loss window. Default for most banks and PG-with-quorum setups.

Sync (quorum)

Primary waits for a quorum of replicas (e.g. 2-of-3). No data loss on single failures. Cost is write latency = max of N round-trips. Used in Spanner, CockroachDB, etcd.

Multi-primary

Any node accepts writes; conflicts are resolved later (last-write-wins, CRDTs, application logic). High availability, painful semantics. Use only when you cannot tolerate a single primary's RTT.

Read-after-write — the gotcha that catches everyone

Async replication creates a subtle bug. A user updates their profile, the write hits the primary, the response returns, the user refreshes the page — and the read goes to a replica that hasn't seen the write yet. The user sees their old name. Three patterns fix it:

Read-your-writes from primary for a session window (5-30s) after a write. Cheap, works.
Pin replica by version. The write returns the LSN; subsequent reads pass it; replicas only serve if they're caught up to that LSN.
Synchronous replication for the affected tables. Pay the latency cost on writes, get strong consistency on reads. Worth it for billing, account state, anything users will check immediately.

Sharding — the four strategies

How you choose to split data across shards is a one-way decision. Migrating between shard keys is project-grade work that can take months. Pick once, pick well.

Strategy	Hot-spot risk	Range scans	Resharding cost
Hash of key	Low (random spread)	Bad — keys are scrambled	High — most keys move when N changes
Range	High — recent rows skew to one shard	Excellent — sequential	Manageable — split a hot range
Consistent hash	Low	Bad	Low — only 1/N of keys move
Directory / lookup	Tunable	Tunable	Cheap — update directory

The shard key, the column you split on, must satisfy three constraints: high cardinality (many distinct values), even distribution (no single value dominates), and present in most queries (otherwise you scatter-gather across all shards on every read). user_id is the textbook good shard key for B2C systems. tenant_id for multi-tenant SaaS. created_at is a textbook bad one — every write goes to today's shard.

The cross-shard query problem

Once data is sharded, three things become hard. Joins across shards. Transactions across shards. Aggregations across shards. Each has a partial answer:

Cross-shard joins: Avoid by denormalising — duplicate the joined data into both shards. Or co-locate (use the same shard key for related tables). Or do the join in the application layer with two round-trips.
Cross-shard transactions: Use two-phase commit (slow, blocks on coordinator failure), the saga pattern (eventual consistency with compensating actions), or just don't — design the schema so logical transactions never cross shard boundaries.
Cross-shard aggregations: Scatter-gather: query all shards in parallel, merge results in the app layer. Acceptable for occasional dashboards, not for hot read paths. For analytics, ETL into a separate columnar store (Redshift, BigQuery, ClickHouse).

CAP and PACELC — the theory you actually need

The CAP theorem says: in the presence of a network partition, a distributed system must choose between Consistency and Availability. There is no choice about Partition tolerance — partitions happen, the question is what you do when they do. PACELC extends it: in the absence of a partition (Else), you still trade Latency vs Consistency. Every database lives somewhere on this map.

	Examples	When to use
CP / EC	Spanner, CockroachDB, etcd, ZooKeeper	Money, inventory, identity, configuration. Anything where stale data is worse than slow data.
AP / EL	Cassandra, DynamoDB (default), Riak	Activity feeds, IoT data, telemetry. Anything where availability matters more than perfect consistency.
CP / EL	Postgres + sync replica, MySQL Group Replication	The pragmatic middle. Strong reads on the primary, fast reads on replicas with bounded staleness.

The hard cases

The hot shard. One celebrity user, one viral product, one popular tenant. They generate 100× the traffic of the average shard. Mitigations: split that one shard further, add a second-level cache for that key, or denormalise into a separate hot-path store.

Replication lag spikes. A long-running write blocks the WAL, replicas fall minutes behind, read-after-write breaks. Monitor lag continuously, alert at 5 seconds, page at 30. Avoid bulk DELETEs and DDL during peak hours; chunk them.

Resharding is harder than the first sharding. The first shard split is a clean greenfield problem. Splitting again — to add capacity, to fix a bad key — is online migration with consistent reads, dual writes, and weeks of vigilance. Pick the right shard key the first time.

Practical defaults

Don't shard until you have to. A single Postgres or MySQL on a big box handles surprisingly large workloads. Replicas + careful indexing buy years.
When you do shard, pick a key that is in 90% of queries. Without it, every read becomes a scatter-gather.
Default replication: semi-sync, 2-of-3 quorum, with one cross-AZ replica for failover.
Read-after-write protection on user-visible writes. The simplest version — read from primary for 30 seconds after the write — solves 95% of cases.
Run a continuous monitor on replication lag. Alert before users notice.
For analytics, replicate into a columnar warehouse and stop running OLAP queries against the OLTP primary.

Database scaling.

The three techniques — and what each one solves

Replication — the four flavours

Read-after-write — the gotcha that catches everyone

Sharding — the four strategies

The cross-shard query problem

CAP and PACELC — the theory you actually need

The hard cases

Practical defaults