Scale to millions on AWS

The classic question. Start with one box. End with a multi-region system serving tens of millions. The trick isn't naming every AWS service — it's knowing what breaks at each step, what you'd reach for next, and being honest about the shortcuts you'd take in real life.

1 · The premise

A small product launches. One Python web app, one Postgres database. From there, you grow. Each stage below picks up where the last one breaks and talks about the next move. The numbers are rough — what matters is the order of magnitude, not the dollar amount.

Stage 0	1 user (you).
Stage 1	100 users.
Stage 2	10K users.
Stage 3	100K users.
Stage 4	1M users.
Stage 5	10M users.
Stage 6	100M users.
Stage 7	Multi-region, full DR.

2 · Stage 0 — One EC2 instance

One t3.small, nginx out front, Postgres on the same box. Route 53 points your domain at the elastic IP. About $30 a month. The whole architecture fits in a paragraph — and at this point, anything more is overkill. Just ship features.

What breaks first. Your laptop's WiFi when you tail the logs.

3 · Stage 1 — 100 users

Move the database off the box. RDS for Postgres, app stays on EC2. Now you get automated backups, point-in-time recovery, and your data survives an accidental rm -rf. Set up CloudWatch alarms on CPU and disk. Pay for a cheap uptime monitor. About $120 a month.

What breaks first. The one EC2 dies mid-deploy, or runs out of memory and takes the site with it.

4 · Stage 2 — 10K users

Push static files to S3, put CloudFront in front. HTML, JS, CSS, images — all of it. Your app server stops serving any of those bytes; the CDN handles it. Add a second EC2 behind an ALB so one box can die without taking the site down. Make the app stateless — sessions in Redis or signed cookies, not in local memory. Flip RDS to Multi-AZ for failover, and add a read replica so the analytics queries don't hammer the primary.

One non-obvious win. Sticky sessions off. Any request should work on any instance. Auth via JWT or a Redis session cache.
What breaks first. The cron that sends emails runs on one box, and you forgot to deploy it to the other. Or your write traffic maxes out the RDS primary.

About $500 a month. Still cheap.

5 · Stage 3 — 100K users

An auto-scaling group, ~6 EC2 instances at peak. Background jobs move into SQS plus a worker pool so the web tier only does fast, synchronous work. Drop ElastiCache (Redis) in front of Postgres for the hot stuff — sessions, rate-limit counters, cached queries. Get credentials out of your AMI and into AWS Secrets Manager.

Observability grows up. Centralised logs (CloudWatch or Datadog), real dashboards. You'll get paged for the first time. The right move is to write a runbook, not silence the alert.
The database starts to hurt. Slow queries on the big table. Add the indexes. Add a read replica if you read a lot more than you write. This is the last stage where "buy a bigger RDS box" is the right answer.
What breaks first. A bad deploy takes both EC2s down at once because they share the same broken AMI. After the second time it happens, you wire up blue/green deploys.

About $3K a month.

6 · Stage 4 — 1M users

This is the shape you'll recognise: ALB → ASG → RDS → ElastiCache → SQS → S3. CloudFront in front. WAF + Shield for DDoS. Three availability zones for the web tier and the database. You can sit here for a long time — Stack Overflow ran on something close to this shape for years.

The database is the bottleneck. Replicas eat most reads; the primary is now write-bound. The honest move is to denormalise before you shard — caches, materialised views, secondary indexes. Sharding is expensive; put it off.
Async work is real now. Email send, image processing, report generation — all on the worker tier. Producers post to SQS, workers consume. Back-pressure is built in.
Deploys get serious. Blue/green on the ASG. Feature flags (LaunchDarkly, or a DB-table you wrote yourself). The on-call rotation is a real thing now.
What breaks first. One bad SQL query takes down the primary. A noisy service eats the whole DB connection pool. The web tier's CPU chart shows 80% of time spent waiting on the database.

About $25K a month.

7 · Stage 5 — 10M users

The monolith starts to split. The first thing you peel off is whatever you change most often, or whatever scales differently from the rest — usually the data-heavy one (search) or the high-write one (notifications, analytics). The move is the same every time: find a bounded context, give it its own database, route to it from the ALB by path prefix. The first split is the hardest. The rest become muscle memory.

Sharded database, at last. One service hits the ~5 TB / 50K-IOPS ceiling of the biggest RDS box. Pick a shard key (user_id is usually safest), build a shard router, double-write during the cutover, then start reading from the new shards. This is a multi-quarter project — don't pretend it isn't.
Search gets its own service. Elasticsearch or OpenSearch. An indexer reads from SQS or Kinesis to keep results fresh.
Caching grows up. Cache-aside in the app. ElastiCache in cluster mode. Real care around stampedes — probabilistic early expiration, single-flight.
Observability stops being optional. Traces (X-Ray, OpenTelemetry), metrics (Prometheus or CloudWatch), structured logs (Loki, Splunk, Datadog), SLOs with error budgets.
Multi-AZ everywhere — but not multi-region yet. Whole-region failures are rare enough that paying for active-active across regions isn't worth it at this point.
What breaks first. The newly-sharded service has a hot shard. Or the analytics pipeline quietly drops 5% of events for two weeks because of a partition leader change nobody noticed.

About $200K a month.

8 · Stage 6 — 100M users

Everything is a microservice. Everything is sharded. Everything has its own SLO. A platform team exists to run the shared bits — Kafka, Kubernetes, observability, CI/CD. Product teams own their own services end to end.

Kafka is the spine. Most service-to-service async traffic goes through it. The "event bus" replaces SQS for the long-running pipelines; SQS still works fine for short job queues.
Multi-region active-active. Reads from the closest region. Writes routed by user location. Async replication across regions. The conflict-resolution story (CRDTs, last-write-wins on certain tables) is now a real one.
Capacity planning is a discipline. Every service has a load test, a known per-instance throughput, a documented scaling story. The big incidents now are about how services interact — service A's retry storm taking down service B.
Cost engineering is a real job. A FinOps team. Reserved instances, spot for batch, savings plans, AZ-aware traffic to dodge cross-AZ data charges (which can be the second-biggest line item on the bill).
Chaos engineering. Game days. Region failover drills. Database failover drills. The first time you do them you find a dozen surprises. By the third time, no more surprises.
What breaks first. An AWS region has a bad day (us-east-1 has had several famous ones). Or someone pushes a global config change that silently breaks a downstream service nobody tested.

About $3M a month.

9 · Stage 7 — Multi-region with full DR

Three or more regions active-active for the main workload. A DR site — sometimes a fourth region, sometimes a different cloud entirely — that can absorb all of production within minutes if it has to. Quarterly tests prove it can. The architecture isn't really changing any more. The story now is operations and resilience.

Latency-routed Route 53 sends users to the closest healthy region.
Aurora Global Database (or DynamoDB Global Tables) for the database. Cross-region replication in seconds, with documented RPO and RTO.
S3 cross-region replication for media. CloudFront in front so the user never knows what region they hit.
Configuration as code, globally distributed. Region failover is a CI/CD pipeline, not a runbook someone reads at 3 AM.
Compliance. SOC 2 Type II, ISO 27001, plus the region-specific stuff (GDPR, CCPA, India DPDP). Data residency is wired into the routing, not bolted on later.

10 · What candidates get wrong

Mistake	Reality
Reaching for microservices at Stage 2	Too early. A well-factored monolith will carry you to a million users.
Sharding the DB before squeezing caches and read replicas	Sharding is a multi-quarter project. Try denormalisation and cache first — most of the time you'll buy yourself another year.
"Just use Lambda for everything"	Cold starts, weird cost curves, painful local dev. Lambda is great for event glue. It's not great as your main web tier past a few thousand QPS unless you've signed up for those trade-offs on purpose.
Skipping observability until things break	You'll be debugging blind. Spend a week on dashboards and traces at Stage 3, not at Stage 5 when it's already on fire.
Multi-region too early	2× the cost, 3× the operational burden. Most products under 100M users don't need it. Multi-AZ covers 99% of what actually goes wrong.
Pretending you'd nail it on the first try	Every real system has scar tissue. The interview is asking "knowing what breaks next, what would you do at each step" — be honest about what you'd push to later.

11 · Cost & SLOs (the through-line)

Stage	Users	Cost / month	Hours of downtime / year tolerable
0	1	~$30	"As long as I notice"
1	100	~$120	~24 h (99.7%)
2	10K	~$500	~8 h (99.9%)
3	100K	~$3K	~4 h (99.95%)
4	1M	~$25K	~1 h (99.99%) for paid plans
5	10M	~$200K	~52 min / year (99.99%)
6	100M	~$3M	~5 min / year (99.999%) for hot-path operations
7	100M+ multi-region	~$5M	Single-region failure invisible to users

Every stage adds infrastructure cost but takes operational pain off your plate. The way to defend each step is to compare the cost of building it now against the cost of running without it. Do the math — most of the time either direction is defensible within an order of magnitude.