Sharding is irreversible and one-way. Once your data is partitioned by a key, changing that key requires migrating every row through a coordination dance that takes weeks and adds a footnote to every postmortem for years. The decision rule for sharding is mostly the rule for delaying it — almost everything you might shard for can be solved another way first.
The honest position: sharding is the last database lever, not the first. Read replicas, partitioning within a single host, denormalisation, caching, and just buying a bigger machine all come before sharding. Each of those costs maybe a week. Sharding costs a quarter, and the shape of the system never recovers from it.
The three signals that earn a shard
- Write throughput exceeds what one host can absorb.
- A modern Postgres or MySQL instance on decent hardware sustains 10–50k writes/sec depending on row size and indexes. If your peak write rate is below that, you do not need to shard for write throughput. If it's above, and you've already removed every avoidable write (denormalised tables, batched commits, write-through caches), then yes — shard.
- Storage exceeds what one host can hold reasonably.
- "Reasonably" means a working set that fits in RAM at ~10% of physical memory, with disk that's not 95% full. A 16-core, 256-GB box can comfortably hold a multi-TB Postgres database. Real cloud limits cap around 64 TB single-instance for managed services. Past those numbers, sharding is mechanical.
- A regional latency requirement that can't be met from one location.
- Read replicas in three regions get you 95% of multi-region read latency for free. Writes are different — if a user in Tokyo must write a row that's authoritative in us-east-1, the 150 ms RTT is the floor. Geo-sharding (each region owns a slice of the keyspace) is the cure. This is the rarest of the three reasons.
The cost a shard adds
- Cross-shard queries become hard.
- SELECT-with-JOIN across shards is no longer a query, it's a distributed transaction or a fan-out plus merge. Aggregations (COUNT, SUM, JOIN) require talking to every shard, then combining. Tools like Vitess, Citus, and CockroachDB exist precisely to hide this — but the abstraction leaks at the edges, especially on transactions.
- Resharding is the worst migration you'll ever do.
- Moving from 4 shards to 8 (or, worse, changing the shard key) requires copying every row in your largest tables while keeping the application running. Two months of plumbing, dual-writes, cutover, validation, rollback plans, on-call schedules. Pick the shard count generously the first time.
- Hot shards.
- The 80/20 rule applies in sharded systems too — 80% of writes go to 20% of shards. If your shard key has a power-law distribution (user_id, where 1% of users are 99% of traffic), one shard will saturate while others sleep. The standard fix is a hash partition on top of the natural key, which spreads load but disables range queries.
- Operational complexity multiplies.
- One database to back up, monitor, upgrade, secure. Now you have N. Every operation that was atomic becomes "do this on each shard." Even mature managed services (Vitess, Spanner) add cluster topology to your mental model.
The decision flowchart, in prose
Ask, in order:
- Is the bottleneck reads? If yes — add read replicas. Done. (Stop reading.)
- Is the bottleneck writes, on one specific table? If yes — partition within the same host (Postgres declarative partitioning, MySQL range partitions). The application sees one table; the storage engine spreads work. Often enough for 10× write growth.
- Is the bottleneck the working set fitting in RAM? If yes — vertical scale first. A 512 GB instance from a cloud provider is a phone call away. Cheap relative to a multi-quarter sharding project.
- Have you already removed every avoidable write via batching, async, caching, and denormalisation? If no — do that first.
- Is the dataset growing in a way that single-host limits will be hit within 12 months? If no — punt. Sharding now solves a future problem; until then, you're paying complexity tax for no benefit.
- You shard. Pick the key carefully — see below.
Picking the shard key
The single most consequential decision in the whole project. The shard key determines how data is distributed, which queries are cheap, which are expensive, and which cross shards. It is almost impossible to change afterwards.
- The key should be present in every common query.
- If your hottest query is "get all orders for user X" then user_id is the natural key. Every query that includes user_id stays on one shard; queries that don't have to fan out. If your top queries don't share a key, sharding will not help — they will all fan out.
- The key should have high cardinality.
- If the key has only 100 distinct values, you can have at most 100 shards before duplicates start sharing a shard arbitrarily. Aim for keys with millions of distinct values: user_id, account_id, customer_id are the usual suspects.
- The key should have a roughly uniform distribution.
- If 1% of your users generate 99% of traffic, sharding by user_id puts that 1% on one shard. Either hash the key (loses range queries) or add a secondary salt for hot accounts (operational complexity).
- Avoid time-based keys for OLTP.
- "Shard by created_at" puts all new writes on the same shard — the current month's partition. Works for OLAP / analytics; disasters for OLTP. Time-based partitioning is fine; time-based sharding is not.
The four alternatives that often win instead
- Vertical scale.
- A 256-core, 1 TB instance is now a normal cloud offering. The price is real but the operational simplicity is enormous. Pay for a year, ship features in the meantime, revisit when growth justifies it.
- Read replicas.
- If reads are the bottleneck and you can tolerate a few milliseconds of replication lag, two or three read replicas absorb most of the reads without changing the application. A few hours of work; a known operational pattern.
- Same-host partitioning.
- Postgres and MySQL both support per-table partitioning that splits one logical table across multiple physical sub-tables on the same host. Lock contention drops; vacuum runs faster; query planner skips irrelevant partitions. Almost no application changes.
- Offload to a purpose-built store.
- Maybe the issue isn't OLTP at all. Move analytics to ClickHouse / BigQuery, search to Elasticsearch, session data to Redis, blobs to S3. Each of those takes pressure off the primary database and rarely requires sharding the result.
If you do shard — pick the right shape
- Range sharding.
- Shard by ranges of the key (1–1M on shard 0, 1M–2M on shard 1, etc.). Preserves range queries on the key. Hot when new keys cluster (timestamps). Used by HBase, BigTable, Spanner.
- Hash sharding.
- Hash(key) → shard. Even distribution by construction. Range queries on the original key become full fan-outs. Used by Cassandra, DynamoDB, MongoDB by default, Vitess, Citus.
- Consistent hashing.
- Hash sharding with virtual nodes to make adding/removing shards cheap. Used by Cassandra rings, DynamoDB partitions, sharded Redis. Operationally smoother than plain hash.
- Directory-based (lookup table).
- A small service maps each key to its shard. Most flexible — you can rebalance individual keys. Most operational overhead — the lookup service is a new failure domain and a new latency hop.
- Distributed SQL.
- CockroachDB, Spanner, YugabyteDB hide sharding behind a SQL interface. You pay extra latency per query (5–20 ms typical), but the application sees a single database. Right when sharding is forced on you and the team doesn't want to manage Vitess.
What a defensible "yes, shard" looks like
- Reason. "Write throughput on orders table hit 18k/sec at peak, projected to 35k/sec within 9 months. Postgres on a 64-core box tops out around 25k for our write shape."
- Volume. "Orders table: 4 TB, growing 8 GB/day. Need to support 5× current peak."
- Shard key. "merchant_id. Present in 95% of queries. Cardinality ~250k merchants. Top-10 merchants get 12% of writes — acceptable skew."
- Topology. "16 shards initially via hash on merchant_id. Vitess. Each shard a 16-core Postgres replica set."
- Cross-shard plan. "Reporting queries via batch ETL to Snowflake. Application code disallows JOIN across merchants."
- Migration. "Dual-write phase for 6 weeks. Read-from-shard cutover after 48 hours of zero diff. Old database stays as backup for 30 days."
Common mistakes
- Sharding too early.
- Twitter spent two years sharding before they had to, then another two years undoing it. The complexity tax is real; you pay it every day. Wait until vertical scale fails.
- Picking a low-cardinality shard key.
- "Shard by country" works until your fourth country (US) has 80% of users. Cardinality matters more than apparent natural fit.
- Underestimating cross-shard queries.
- "All our queries have user_id in them." Then someone writes a leaderboard that aggregates across users. Then someone runs a report. Then customer support needs to look up by email. Audit your query log before committing to a key.
- Picking the shard count too small.
- Going from 4 to 8 is the same operational cost as going from 4 to 64. Pick a count that gives you headroom for 5× growth and don't reshard for years. Default to 16 or 32 even if 4 would technically fit.
- Forgetting reference data.
- Some tables (countries, currencies, product catalog) should not be sharded — they're read by every query on every shard. Either replicate them to every shard or keep them in an unsharded "global" database.
What to read next
- Sharding & partitioning · learn path
- The algorithmic detail behind the operational choices in this chapter.
- Distributed SQL internals · learn path
- How Spanner, CockroachDB, and YugabyteDB hide the sharding from the application.
- Database-sharding simulator · interactive
- Push on the partition strategies and see where the hot shards land.
- Database scaling · primer
- The broader pattern this chapter narrows down on.