Multi-region.
Most cloud architectures aren't multi-region, and most of them don't need to be. Multi-AZ inside one region covers ~99% of failure modes for a small fraction of the cost and complexity. Multi-region exists for two reasons: you have users on more than one continent and want the latency win, or you have a regulatory or business need for resilience to a whole region going down. Pick which one applies before you draw the architecture.
1 · The two reasons you'd actually do it
- Latency. Users in Europe shouldn't talk to a US-east origin for every page view. The round-trip is 100–120 ms even on the fastest fibre. Putting compute and data in a region near the user cuts that to single-digit ms within-region plus the routing decision.
- Resilience to a whole-region outage. AWS us-east-1 has had a handful of well-known multi-hour incidents. If your business breaks when that happens, you need a story for serving from somewhere else. For most products this isn't worth the cost; for financial services, healthcare, and anything regulator-watched, it is.
What's not a good reason: vague "future-proofing." Multi-region is a 2–3× cost multiplier and a 3–5× operational multiplier. If your concrete use case is "we might want it someday," you almost certainly don't. Multi-AZ inside one region is the right starting point.
2 · Three multi-region shapes
| Shape | RTO / RPO | Cost multiplier | When to pick it |
|---|---|---|---|
| Pilot light | RTO: hours · RPO: minutes | ~1.2× | Cold DR region, data replicated async, compute scaled to zero. Cheapest. You're betting whole-region outages are rare enough to absorb a slow recovery. |
| Warm standby | RTO: minutes · RPO: seconds | ~1.5× | Reduced-capacity DR region, ready to scale on failover. Reasonable middle ground for serious B2B. |
| Active-active | RTO: ~0 · RPO: seconds | ~2–3× | Both (or more) regions serving traffic. Single-region failure is invisible to users. The model for consumer services at scale. |
RTO = Recovery Time Objective. RPO = Recovery Point Objective (how much data you accept losing). Both numbers should be in the architecture doc, not in someone's head.
3 · The AWS canonical version
| Layer | Service | What it does |
|---|---|---|
| Traffic routing | Route 53 (latency / weighted / failover) | Latency routing sends users to the closest region. Failover routing flips traffic on health-check failure. Weighted routing for blue/green. |
| Edge acceleration | Global Accelerator | Anycast IPs that route over AWS backbone to the nearest region. Lower latency variance than DNS-based routing. |
| Relational DB | Aurora Global Database | One writer region, ≤5 reader regions, <1 second cross-region lag. Promotes a reader on failover. |
| Relational DB (active-active) | Aurora DSQL (preview) | Distributed serverless SQL with active-active writes. |
| NoSQL | DynamoDB Global Tables | Multi-region multi-active. Last-writer-wins conflict resolution. |
| Object storage | S3 Cross-Region Replication | Async per-bucket replication to another region. Often paired with CloudFront so the user never knows which origin is alive. |
| Event bus | EventBridge cross-region | Replicate events from one bus to another region. |
| Caching | ElastiCache Global Datastore (Redis) | Cross-region Redis replication. Eventual consistency. |
| Network | Cloud WAN / Transit Gateway peering | Cross-region VPC connectivity. |
4 · GCP and Azure equivalents
| Concept | AWS | GCP | Azure |
|---|---|---|---|
| Traffic routing | Route 53 + Global Accelerator | Global LB (anycast) | Front Door / Traffic Manager |
| Globally strong SQL | Aurora DSQL (preview) | Spanner (mature) | Cosmos DB SQL API with strong |
| Multi-region NoSQL | DynamoDB Global Tables | Firestore / Bigtable replication | Cosmos DB multi-region writes |
| Object cross-region | S3 CRR | GCS dual-region / multi-region buckets | Blob Storage GRS / RA-GRS |
| Network | TGW peering / Cloud WAN | Global VPC (single VPC spans regions) | Virtual WAN |
5 · Failover drills
The architecture isn't multi-region until you've failed over for real, at least once. Things that get caught in drills, in order of frequency:
- Stale DNS at the edge. TTLs you set to 300 seconds get cached for 30 minutes by some resolver in the wild. Plan failover with the longest TTL anyone could be holding.
- Region-specific config. Hardcoded ARNs, region-pinned bucket names, secrets that exist only in the active region. The drill exposes them; the runbook documents them.
- Workload imbalance. The standby region is 50% capacity. Failover hits it at 100% traffic. It melts. Either size both regions for full load or shed traffic during failover.
- Replication lag. Aurora Global lag spikes during failover. Reads from the promoted region miss the last few seconds of writes. Document the RPO honestly.
- Cross-region IAM. Roles, KMS keys, secrets — all region-scoped by default. The "decrypt this in the DR region" step gets forgotten.
The drill cadence at most serious shops is quarterly. The first one is full of surprises; by the fourth or fifth, the runbook handles itself.
6 · Cost note
- Cross-region data transfer. AWS charges $0.02/GB for cross-region replication. A 10 TB/day replication is ~$6K/month just in transfer.
- Idle DR capacity. Pilot light is cheap; warm standby costs you 20–50% of primary; active-active doubles the compute bill.
- Multi-region managed DB premiums. Aurora Global adds ~20% to RDS cost per replica region. DynamoDB Global Tables is 1× per region (so 3 regions = 3×). Spanner cost scales with node count and region count.
A reasonable model for a serious B2B product: active-active across 2 regions with sized-up DR; quarterly failover drills; total infra cost roughly 2–2.5× single-region. Worth it if the business case is real. Not worth it as a "good practice" without one.
Further reading
- "Disaster Recovery of Workloads on AWS" (whitepaper). The four-shape model (backup-restore, pilot light, warm standby, active-active) in detail.
- "Spanner: Becoming a SQL System" (SIGMOD 2017). The system-design rationale for globally distributed SQL.
- Adjacent: Availability patterns. The math behind nines, applied to multi-region.
- Adjacent: CAP & PACELC. The consistency trade-offs you're accepting with each shape.
- Adjacent: Cost engineering. The cross-region transfer line on the bill in detail.