Handbook · Vol. IV · 2026 Track I · The data layer · piece 2 of 4 Deep dive

Track I · The data layer

Database scaling.

Replication, sharding, partitioning, and the operational reality of each — why read replicas don't fix writes, why a sharding key is one-way, and what to do when shards rebalance.

Track I · The data layer
How data lives, scales, and recovers.
  1. Primer
    Database indexing
  2. Deep dive
    Database scaling
  3. Primer
    NoSQL databases
  4. Decision rule
    When to shard

Scaling the app layer is easy — copy it. Scaling the database is hard because it holds the one thing you can't lose. Three techniques solve almost everything: replicate, partition, shard.

Scaling the application layer is easy: add more stateless servers behind a load balancer. Scaling the database is the hardest problem in distributed systems, because the database holds your most valuable asset: data that must be durable, consistent, and available. This module covers the three fundamental techniques — replication, partitioning, and sharding — alongside the theoretical framework (CAP, PACELC) that governs every choice you'll make.

REPLICATION (left) · SHARDING (right) · ORTHOGONAL primary writes replica 1 replica 2 replica 3 copy of full data scales reads SHARDED BY user_id shard A user_id 0…M-1 shard B user_id M…2M-1 shard C user_id 2M…3M-1 A replicas (×3) B replicas (×3) C replicas (×3) each shard is itself replicated · combine to scale both axes
Replication scales reads and provides failover. Sharding scales writes and storage. Real systems do both: each shard is itself a replicated cluster.

The three techniques — and what each one solves

Replication
Copy the full dataset to N machines. Reads can hit any copy; writes go to the primary (or to all, depending on the model). Solves: read scale, durability, failover.
Partitioning
Split the data within a single database into independent units (per-table or per-tenant) on the same machine. Solves: index size, lock contention, vacuum/maintenance windows.
Sharding
Split the data across machines. Each machine holds a subset; routing logic decides which one. Solves: write scale and storage scale, the only two things replication doesn't.

Replication — the four flavours

Every replication system is a point on the trade-off curve between consistency and availability. Pick the model that matches the workload, not the one your database happens to default to.

Async (primary → replica)

Primary writes, returns success, replicates in the background. Fast writes, strong durability on primary. Replicas can lag seconds to minutes. Risk: if primary dies, recent writes are gone.

Semi-sync

Primary waits for at least one replica to ack before returning success. Adds one round-trip but bounds the data loss window. Default for most banks and PG-with-quorum setups.

Sync (quorum)

Primary waits for a quorum of replicas (e.g. 2-of-3). No data loss on single failures. Cost is write latency = max of N round-trips. Used in Spanner, CockroachDB, etcd.

Multi-primary

Any node accepts writes; conflicts are resolved later (last-write-wins, CRDTs, application logic). High availability, painful semantics. Use only when you cannot tolerate a single primary's RTT.

Read-after-write — the gotcha that catches everyone

Async replication creates a subtle bug. A user updates their profile, the write hits the primary, the response returns, the user refreshes the page — and the read goes to a replica that hasn't seen the write yet. The user sees their old name. Three patterns fix it:

  1. Read-your-writes from primary for a session window (5-30s) after a write. Cheap, works.
  2. Pin replica by version. The write returns the LSN; subsequent reads pass it; replicas only serve if they're caught up to that LSN.
  3. Synchronous replication for the affected tables. Pay the latency cost on writes, get strong consistency on reads. Worth it for billing, account state, anything users will check immediately.

Sharding — the four strategies

How you choose to split data across shards is a one-way decision. Migrating between shard keys is project-grade work that can take months. Pick once, pick well.

StrategyHot-spot riskRange scansResharding cost
Hash of keyLow (random spread)Bad — keys are scrambledHigh — most keys move when N changes
RangeHigh — recent rows skew to one shardExcellent — sequentialManageable — split a hot range
Consistent hashLowBadLow — only 1/N of keys move
Directory / lookupTunableTunableCheap — update directory

The shard key, the column you split on, must satisfy three constraints: high cardinality (many distinct values), even distribution (no single value dominates), and present in most queries (otherwise you scatter-gather across all shards on every read). user_id is the textbook good shard key for B2C systems. tenant_id for multi-tenant SaaS. created_at is a textbook bad one — every write goes to today's shard.

The cross-shard query problem

Once data is sharded, three things become hard. Joins across shards. Transactions across shards. Aggregations across shards. Each has a partial answer:

Cross-shard joins
Avoid by denormalising — duplicate the joined data into both shards. Or co-locate (use the same shard key for related tables). Or do the join in the application layer with two round-trips.
Cross-shard transactions
Use two-phase commit (slow, blocks on coordinator failure), the saga pattern (eventual consistency with compensating actions), or just don't — design the schema so logical transactions never cross shard boundaries.
Cross-shard aggregations
Scatter-gather: query all shards in parallel, merge results in the app layer. Acceptable for occasional dashboards, not for hot read paths. For analytics, ETL into a separate columnar store (Redshift, BigQuery, ClickHouse).

CAP and PACELC — the theory you actually need

The CAP theorem says: in the presence of a network partition, a distributed system must choose between Consistency and Availability. There is no choice about Partition tolerance — partitions happen, the question is what you do when they do. PACELC extends it: in the absence of a partition (Else), you still trade Latency vs Consistency. Every database lives somewhere on this map.

ExamplesWhen to use
CP / ECSpanner, CockroachDB, etcd, ZooKeeperMoney, inventory, identity, configuration. Anything where stale data is worse than slow data.
AP / ELCassandra, DynamoDB (default), RiakActivity feeds, IoT data, telemetry. Anything where availability matters more than perfect consistency.
CP / ELPostgres + sync replica, MySQL Group ReplicationThe pragmatic middle. Strong reads on the primary, fast reads on replicas with bounded staleness.

The hard cases

The hot shard. One celebrity user, one viral product, one popular tenant. They generate 100× the traffic of the average shard. Mitigations: split that one shard further, add a second-level cache for that key, or denormalise into a separate hot-path store.
Replication lag spikes. A long-running write blocks the WAL, replicas fall minutes behind, read-after-write breaks. Monitor lag continuously, alert at 5 seconds, page at 30. Avoid bulk DELETEs and DDL during peak hours; chunk them.
Resharding is harder than the first sharding. The first shard split is a clean greenfield problem. Splitting again — to add capacity, to fix a bad key — is online migration with consistent reads, dual writes, and weeks of vigilance. Pick the right shard key the first time.

Practical defaults

  1. Don't shard until you have to. A single Postgres or MySQL on a big box handles surprisingly large workloads. Replicas + careful indexing buy years.
  2. When you do shard, pick a key that is in 90% of queries. Without it, every read becomes a scatter-gather.
  3. Default replication: semi-sync, 2-of-3 quorum, with one cross-AZ replica for failover.
  4. Read-after-write protection on user-visible writes. The simplest version — read from primary for 30 seconds after the write — solves 95% of cases.
  5. Run a continuous monitor on replication lag. Alert before users notice.
  6. For analytics, replicate into a columnar warehouse and stop running OLAP queries against the OLTP primary.
Found this useful?