06 / 08
Cloud Codex / 06

Multi-region.

Most cloud architectures aren't multi-region, and most of them don't need to be. Multi-AZ inside one region covers ~99% of failure modes for a small fraction of the cost and complexity. Multi-region exists for two reasons: you have users on more than one continent and want the latency win, or you have a regulatory or business need for resilience to a whole region going down. Pick which one applies before you draw the architecture.


1 · The two reasons you'd actually do it

  • Latency. Users in Europe shouldn't talk to a US-east origin for every page view. The round-trip is 100–120 ms even on the fastest fibre. Putting compute and data in a region near the user cuts that to single-digit ms within-region plus the routing decision.
  • Resilience to a whole-region outage. AWS us-east-1 has had a handful of well-known multi-hour incidents. If your business breaks when that happens, you need a story for serving from somewhere else. For most products this isn't worth the cost; for financial services, healthcare, and anything regulator-watched, it is.

What's not a good reason: vague "future-proofing." Multi-region is a 2–3× cost multiplier and a 3–5× operational multiplier. If your concrete use case is "we might want it someday," you almost certainly don't. Multi-AZ inside one region is the right starting point.

2 · Three multi-region shapes

ShapeRTO / RPOCost multiplierWhen to pick it
Pilot lightRTO: hours · RPO: minutes~1.2×Cold DR region, data replicated async, compute scaled to zero. Cheapest. You're betting whole-region outages are rare enough to absorb a slow recovery.
Warm standbyRTO: minutes · RPO: seconds~1.5×Reduced-capacity DR region, ready to scale on failover. Reasonable middle ground for serious B2B.
Active-activeRTO: ~0 · RPO: seconds~2–3×Both (or more) regions serving traffic. Single-region failure is invisible to users. The model for consumer services at scale.

RTO = Recovery Time Objective. RPO = Recovery Point Objective (how much data you accept losing). Both numbers should be in the architecture doc, not in someone's head.

3 · The AWS canonical version

LayerServiceWhat it does
Traffic routingRoute 53 (latency / weighted / failover)Latency routing sends users to the closest region. Failover routing flips traffic on health-check failure. Weighted routing for blue/green.
Edge accelerationGlobal AcceleratorAnycast IPs that route over AWS backbone to the nearest region. Lower latency variance than DNS-based routing.
Relational DBAurora Global DatabaseOne writer region, ≤5 reader regions, <1 second cross-region lag. Promotes a reader on failover.
Relational DB (active-active)Aurora DSQL (preview)Distributed serverless SQL with active-active writes.
NoSQLDynamoDB Global TablesMulti-region multi-active. Last-writer-wins conflict resolution.
Object storageS3 Cross-Region ReplicationAsync per-bucket replication to another region. Often paired with CloudFront so the user never knows which origin is alive.
Event busEventBridge cross-regionReplicate events from one bus to another region.
CachingElastiCache Global Datastore (Redis)Cross-region Redis replication. Eventual consistency.
NetworkCloud WAN / Transit Gateway peeringCross-region VPC connectivity.

4 · GCP and Azure equivalents

ConceptAWSGCPAzure
Traffic routingRoute 53 + Global AcceleratorGlobal LB (anycast)Front Door / Traffic Manager
Globally strong SQLAurora DSQL (preview)Spanner (mature)Cosmos DB SQL API with strong
Multi-region NoSQLDynamoDB Global TablesFirestore / Bigtable replicationCosmos DB multi-region writes
Object cross-regionS3 CRRGCS dual-region / multi-region bucketsBlob Storage GRS / RA-GRS
NetworkTGW peering / Cloud WANGlobal VPC (single VPC spans regions)Virtual WAN
Spanner is the standout here. Globally linearisable SQL with single-digit-ms latency for in-region reads and ~100ms for cross-region writes. The only commercial managed DB in this tier. If your workload needs strong consistency across regions, this is one of the few times GCP is the obvious pick even in an AWS shop.

5 · Failover drills

The architecture isn't multi-region until you've failed over for real, at least once. Things that get caught in drills, in order of frequency:

  • Stale DNS at the edge. TTLs you set to 300 seconds get cached for 30 minutes by some resolver in the wild. Plan failover with the longest TTL anyone could be holding.
  • Region-specific config. Hardcoded ARNs, region-pinned bucket names, secrets that exist only in the active region. The drill exposes them; the runbook documents them.
  • Workload imbalance. The standby region is 50% capacity. Failover hits it at 100% traffic. It melts. Either size both regions for full load or shed traffic during failover.
  • Replication lag. Aurora Global lag spikes during failover. Reads from the promoted region miss the last few seconds of writes. Document the RPO honestly.
  • Cross-region IAM. Roles, KMS keys, secrets — all region-scoped by default. The "decrypt this in the DR region" step gets forgotten.

The drill cadence at most serious shops is quarterly. The first one is full of surprises; by the fourth or fifth, the runbook handles itself.

6 · Cost note

  • Cross-region data transfer. AWS charges $0.02/GB for cross-region replication. A 10 TB/day replication is ~$6K/month just in transfer.
  • Idle DR capacity. Pilot light is cheap; warm standby costs you 20–50% of primary; active-active doubles the compute bill.
  • Multi-region managed DB premiums. Aurora Global adds ~20% to RDS cost per replica region. DynamoDB Global Tables is 1× per region (so 3 regions = 3×). Spanner cost scales with node count and region count.

A reasonable model for a serious B2B product: active-active across 2 regions with sized-up DR; quarterly failover drills; total infra cost roughly 2–2.5× single-region. Worth it if the business case is real. Not worth it as a "good practice" without one.

Further reading

  • "Disaster Recovery of Workloads on AWS" (whitepaper). The four-shape model (backup-restore, pilot light, warm standby, active-active) in detail.
  • "Spanner: Becoming a SQL System" (SIGMOD 2017). The system-design rationale for globally distributed SQL.
  • Adjacent: Availability patterns. The math behind nines, applied to multi-region.
  • Adjacent: CAP & PACELC. The consistency trade-offs you're accepting with each shape.
  • Adjacent: Cost engineering. The cross-region transfer line on the bill in detail.
Found this useful?