Scaling the app layer is easy — copy it. Scaling the database is hard because it holds the one thing you can't lose. Three techniques solve almost everything: replicate, partition, shard.
Scaling the application layer is easy: add more stateless servers behind a load balancer. Scaling the database is the hardest problem in distributed systems, because the database holds your most valuable asset: data that must be durable, consistent, and available. This module covers the three fundamental techniques — replication, partitioning, and sharding — alongside the theoretical framework (CAP, PACELC) that governs every choice you'll make.
The three techniques — and what each one solves
- Replication
- Copy the full dataset to N machines. Reads can hit any copy; writes go to the primary (or to all, depending on the model). Solves: read scale, durability, failover.
- Partitioning
- Split the data within a single database into independent units (per-table or per-tenant) on the same machine. Solves: index size, lock contention, vacuum/maintenance windows.
- Sharding
- Split the data across machines. Each machine holds a subset; routing logic decides which one. Solves: write scale and storage scale, the only two things replication doesn't.
Replication — the four flavours
Every replication system is a point on the trade-off curve between consistency and availability. Pick the model that matches the workload, not the one your database happens to default to.
Primary writes, returns success, replicates in the background. Fast writes, strong durability on primary. Replicas can lag seconds to minutes. Risk: if primary dies, recent writes are gone.
Primary waits for at least one replica to ack before returning success. Adds one round-trip but bounds the data loss window. Default for most banks and PG-with-quorum setups.
Primary waits for a quorum of replicas (e.g. 2-of-3). No data loss on single failures. Cost is write latency = max of N round-trips. Used in Spanner, CockroachDB, etcd.
Any node accepts writes; conflicts are resolved later (last-write-wins, CRDTs, application logic). High availability, painful semantics. Use only when you cannot tolerate a single primary's RTT.
Read-after-write — the gotcha that catches everyone
Async replication creates a subtle bug. A user updates their profile, the write hits the primary, the response returns, the user refreshes the page — and the read goes to a replica that hasn't seen the write yet. The user sees their old name. Three patterns fix it:
- Read-your-writes from primary for a session window (5-30s) after a write. Cheap, works.
- Pin replica by version. The write returns the LSN; subsequent reads pass it; replicas only serve if they're caught up to that LSN.
- Synchronous replication for the affected tables. Pay the latency cost on writes, get strong consistency on reads. Worth it for billing, account state, anything users will check immediately.
Sharding — the four strategies
How you choose to split data across shards is a one-way decision. Migrating between shard keys is project-grade work that can take months. Pick once, pick well.
| Strategy | Hot-spot risk | Range scans | Resharding cost |
|---|---|---|---|
| Hash of key | Low (random spread) | Bad — keys are scrambled | High — most keys move when N changes |
| Range | High — recent rows skew to one shard | Excellent — sequential | Manageable — split a hot range |
| Consistent hash | Low | Bad | Low — only 1/N of keys move |
| Directory / lookup | Tunable | Tunable | Cheap — update directory |
The shard key, the column you split on, must satisfy three constraints: high cardinality (many distinct values), even distribution (no single value dominates), and present in most queries (otherwise you scatter-gather across all shards on every read). user_id is the textbook good shard key for B2C systems. tenant_id for multi-tenant SaaS. created_at is a textbook bad one — every write goes to today's shard.
The cross-shard query problem
Once data is sharded, three things become hard. Joins across shards. Transactions across shards. Aggregations across shards. Each has a partial answer:
- Cross-shard joins
- Avoid by denormalising — duplicate the joined data into both shards. Or co-locate (use the same shard key for related tables). Or do the join in the application layer with two round-trips.
- Cross-shard transactions
- Use two-phase commit (slow, blocks on coordinator failure), the saga pattern (eventual consistency with compensating actions), or just don't — design the schema so logical transactions never cross shard boundaries.
- Cross-shard aggregations
- Scatter-gather: query all shards in parallel, merge results in the app layer. Acceptable for occasional dashboards, not for hot read paths. For analytics, ETL into a separate columnar store (Redshift, BigQuery, ClickHouse).
CAP and PACELC — the theory you actually need
The CAP theorem says: in the presence of a network partition, a distributed system must choose between Consistency and Availability. There is no choice about Partition tolerance — partitions happen, the question is what you do when they do. PACELC extends it: in the absence of a partition (Else), you still trade Latency vs Consistency. Every database lives somewhere on this map.
| Examples | When to use | |
|---|---|---|
| CP / EC | Spanner, CockroachDB, etcd, ZooKeeper | Money, inventory, identity, configuration. Anything where stale data is worse than slow data. |
| AP / EL | Cassandra, DynamoDB (default), Riak | Activity feeds, IoT data, telemetry. Anything where availability matters more than perfect consistency. |
| CP / EL | Postgres + sync replica, MySQL Group Replication | The pragmatic middle. Strong reads on the primary, fast reads on replicas with bounded staleness. |
The hard cases
Practical defaults
- Don't shard until you have to. A single Postgres or MySQL on a big box handles surprisingly large workloads. Replicas + careful indexing buy years.
- When you do shard, pick a key that is in 90% of queries. Without it, every read becomes a scatter-gather.
- Default replication: semi-sync, 2-of-3 quorum, with one cross-AZ replica for failover.
- Read-after-write protection on user-visible writes. The simplest version — read from primary for 30 seconds after the write — solves 95% of cases.
- Run a continuous monitor on replication lag. Alert before users notice.
- For analytics, replicate into a columnar warehouse and stop running OLAP queries against the OLTP primary.