Cloud bills are made of three line items hiding behind dozens. Compute, storage, and network egress are 80% of every modern cloud spend. The other 20% — managed services, support, reserved-instance amortisation, data-transfer-out — confuse the picture but rarely change the conclusion. The skill is being able to estimate the bill from the architecture diagram, before you sign anything.
Every system-design exercise should end with a cost estimate. Not because the exact number matters — it'll be wrong — but because the order of magnitude tells you whether the design is sane. A page that says "we'll cache aggressively" is more credible when you can also say "that saves $40k/month at our scale."
The three things that cost money
- Compute.
- vCPUs and RAM, hour by hour. AWS m6i.large (2 vCPU, 8 GB) is roughly $0.10/hr on-demand, $0.06/hr with a 1-year reserved instance. A rough mental model: $72/month for a small instance, $700/month for a medium-large one (8 vCPU/32 GB), $7k/month for a beefy one (64 vCPU/256 GB). Multiply by N for your replica count. Cross-cloud comparison is within 20% — GCP and Azure price similarly.
- Storage.
- $0.10/GB-month for standard SSD, $0.023/GB-month for spinning disk / cold storage (S3, GCS). IOPS are a separate charge above a baseline. A 1 TB database costs $100/month in raw disk; add backups (×2), replicas (×3) and you're at $700/month. Snapshots are cheap; lifecycle policies stop them piling up.
- Egress.
- The line item that breaks budgets. $0.09/GB egress from AWS / GCP to the public internet, dropping at scale to about $0.05/GB. Inter-region transfer is $0.02/GB. Intra-region is free. A service that ships 100 TB/month to users costs $9k/month in egress alone. CDNs (CloudFront, Cloudflare) cut this by roughly 5–10× and are usually the first thing to do.
The back-of-envelope formula
Take the workload spec and turn it into dollars. Five lines.
- Compute. N hosts × instance hourly cost × 730 hours/month.
- Storage. Total GB (data + indexes + WAL + backups × retention) × $0.10/GB-month.
- Egress. Average response size × monthly request count → GB/month × $0.09/GB. Subtract whatever the CDN caches.
- Managed services. RDS roughly 1.5× raw EC2. ElastiCache roughly 1.3×. SQS at $0.40 per million requests is cheap until you cross 100M/day. Spanner / DynamoDB charge per RCU/WCU — actually do the multiplication.
- Padding. Add 20% for things you forgot (load balancers, NAT gateways, KMS, CloudWatch, support tier, IAM roles you didn't notice were chargeable).
That's it. The result is within 25% of the real bill for most architectures. If your design comes out at $50k/month, the real bill will be $40–60k — which is a useful number, not a precise one.
A worked example
The search-suggest API from the capacity-planning chapter: 50k req/s peak, 10k average, p99 30 ms target, single AZ tolerance.
- Compute.
- 6 pods on 3 nodes (per the capacity plan), m6i.xlarge = 4 vCPU / 16 GB at ~$0.20/hr. 3 nodes × $0.20 × 730h = $438/month.
- Redis.
- cache.m6g.large (2 nodes for HA) at ~$0.16/hr each. 2 × $0.16 × 730h ≈ $234/month.
- Postgres.
- db.m6g.large + 1 read replica, 100 GB SSD. ~$0.34/hr × 2 × 730h + $10 storage ≈ $506/month.
- Egress.
- Average response 4 KB × 10k rps average × 86 400 sec/day × 30 days ≈ 100 TB/month. After CDN (70% hit rate), ~30 TB egress × $0.09 = $2,700/month. Pre-CDN it would be $9,000.
- Load balancer + misc.
- ALB ~$25/month, NAT gateway ~$45/month, CloudWatch ~$30, KMS ~$5. ~$110/month.
- Padding (20%).
- Add ~$800.
- Total.
- ~$4,800/month.
If your boss/CFO/finance team expected $1k or $100k, your architecture isn't matching their mental model. Catching that mismatch before launch is the whole point of doing this exercise.
The patterns that change the bill
- Reserved instances / Savings Plans.
- 1-year commitment cuts compute by ~30%, 3-year by ~50%. Free money for steady-state workloads. Don't apply to autoscaled burst capacity — keep that on-demand.
- Spot / Preemptible.
- 60–90% discount on compute for fault-tolerant workloads (batch, ETL, stateless replicas). The risk is interruption; the design has to tolerate it. CI runners, ML training, stateless web tiers are all good candidates.
- CDN.
- Cuts egress by 5–10×, plus drops latency. Almost always the first thing to do once egress matters. Cloudflare's per-GB pricing is around $0.01/GB for the standard plans; CloudFront sits around $0.02–0.05/GB at most volumes.
- Lifecycle policies on S3 / GCS.
- Move data older than 30 days to Glacier / Coldline. From $0.023/GB-month to $0.004/GB-month — 5× cheaper. Free if you're already on the bucket; just configure the rule.
- Data compression.
- Zstd compression on stored data is ~3× shrink and effectively free CPU. Applied to S3 logs, database backups, and Kafka topics it cuts storage and egress both. Most engineers don't think to enable it.
- Right-sizing.
- CloudWatch / Datadog will tell you which instances are running at 5% CPU. They should be smaller. Right-sizing alone routinely cuts cloud bills by 20–30% for teams that never look.
The patterns that quietly blow it up
- Cross-region traffic.
- Replication to a DR region. ETL pipelines that pull from prod in one region and load in another. $0.02/GB adds up — replicating 10 TB/day across regions is $6k/month. Look at your ETL job DAG before signing the MSA.
- NAT gateway.
- $0.045/hr fixed plus $0.045/GB processed. A NAT gateway in front of a private subnet that pulls package updates can cost more than the instances behind it. VPC endpoints to S3/DynamoDB / Interface endpoints for everything else fix this.
- Logs.
- CloudWatch Logs at $0.50/GB ingested. A noisy service emitting 10 GB/day burns $150/month just on ingest. Move debug logs out of CloudWatch, or downsample.
- Idle resources.
- Unattached EBS volumes. Stopped EC2 instances that still pay for storage. Old snapshots. NAT gateways in a VPC nobody uses anymore. The cloud bill grows from neglect, not just usage.
- Data transfer between AZs.
- Cross-AZ traffic in the same region is $0.01/GB each way. A multi-AZ Kafka cluster with 100 MB/s replication = $250/month per AZ pair. Worth knowing; rarely worth optimising unless throughput is enormous.
What a defensible cost estimate looks like
Six lines that turn an architecture into a number anyone can sanity-check.
- Workload. "X req/s peak, Y average, Z GB stored, average response size N KB."
- Compute. "Pod count, instance type, monthly subtotal."
- Storage. "DB + cache + object storage, monthly subtotal."
- Egress. "TB/month → GB out → after CDN, monthly subtotal."
- Managed. "RDS, ElastiCache, queue, monitoring — monthly subtotal."
- Total + padding. "Sum × 1.2 = estimate. Range ±25%."
Common mistakes
- Counting one of everything.
- One instance, one database, one load balancer. Real systems have N replicas, ≥2 AZs, primary + standby + read replicas, dev / staging / prod. Multiply early.
- Forgetting egress.
- The line item people don't think about, and the one that dominates at scale. Always estimate it. CDN if it's significant.
- Mixing on-demand and reserved without tracking which.
- You'll discover six months in that your "saved" 40% only applies to the steady baseline, not the autoscaled fleet. Track the reserved coverage as a percentage of compute hours.
- Pricing managed services as raw IaaS.
- RDS isn't EC2 + EBS. DynamoDB isn't pay-per-instance. Look up the actual pricing model; don't approximate from compute.
- Ignoring support tier.
- Business Support is 10% of monthly spend. Enterprise is 3–10%. Real money on a $1M/yr bill.
What to read next
- Capacity planning · primer
- The exercise that produces the inputs to this one. Capacity tells you how many of what; cost tells you what it bills.
- Scaling out · primer
- The horizontal-growth patterns and how each affects the cost line.
- CDN — anatomy · guide
- The single biggest egress reducer. Worth understanding the cache key, the TTL, and the invalidation cost.
- Queueing theory · learn path
- The math behind sizing for utilisation, which is what you bill against.