How to estimate cost — System Design Handbook

Cloud bills are made of three line items hiding behind dozens. Compute, storage, and network egress are 80% of every modern cloud spend. The other 20% — managed services, support, reserved-instance amortisation, data-transfer-out — confuse the picture but rarely change the conclusion. The skill is being able to estimate the bill from the architecture diagram, before you sign anything.

Every system-design exercise should end with a cost estimate. Not because the exact number matters — it'll be wrong — but because the order of magnitude tells you whether the design is sane. A page that says "we'll cache aggressively" is more credible when you can also say "that saves $40k/month at our scale."

The three things that cost money

Compute.: vCPUs and RAM, hour by hour. AWS m6i.large (2 vCPU, 8 GB) is roughly $0.10/hr on-demand, $0.06/hr with a 1-year reserved instance. A rough mental model: $72/month for a small instance, $700/month for a medium-large one (8 vCPU/32 GB), $7k/month for a beefy one (64 vCPU/256 GB). Multiply by N for your replica count. Cross-cloud comparison is within 20% — GCP and Azure price similarly.
Storage.: $0.10/GB-month for standard SSD, $0.023/GB-month for spinning disk / cold storage (S3, GCS). IOPS are a separate charge above a baseline. A 1 TB database costs $100/month in raw disk; add backups (×2), replicas (×3) and you're at $700/month. Snapshots are cheap; lifecycle policies stop them piling up.
Egress.: The line item that breaks budgets. $0.09/GB egress from AWS / GCP to the public internet, dropping at scale to about $0.05/GB. Inter-region transfer is $0.02/GB. Intra-region is free. A service that ships 100 TB/month to users costs $9k/month in egress alone. CDNs (CloudFront, Cloudflare) cut this by roughly 5–10× and are usually the first thing to do.

The back-of-envelope formula

Take the workload spec and turn it into dollars. Five lines.

Compute. N hosts × instance hourly cost × 730 hours/month.
Storage. Total GB (data + indexes + WAL + backups × retention) × $0.10/GB-month.
Egress. Average response size × monthly request count → GB/month × $0.09/GB. Subtract whatever the CDN caches.
Managed services. RDS roughly 1.5× raw EC2. ElastiCache roughly 1.3×. SQS at $0.40 per million requests is cheap until you cross 100M/day. Spanner / DynamoDB charge per RCU/WCU — actually do the multiplication.
Padding. Add 20% for things you forgot (load balancers, NAT gateways, KMS, CloudWatch, support tier, IAM roles you didn't notice were chargeable).

That's it. The result is within 25% of the real bill for most architectures. If your design comes out at $50k/month, the real bill will be $40–60k — which is a useful number, not a precise one.

A worked example

The search-suggest API from the capacity-planning chapter: 50k req/s peak, 10k average, p99 30 ms target, single AZ tolerance.

Compute.: 6 pods on 3 nodes (per the capacity plan), m6i.xlarge = 4 vCPU / 16 GB at ~$0.20/hr. 3 nodes × $0.20 × 730h = $438/month.
Redis.: cache.m6g.large (2 nodes for HA) at ~$0.16/hr each. 2 × $0.16 × 730h ≈ $234/month.
Postgres.: db.m6g.large + 1 read replica, 100 GB SSD. ~$0.34/hr × 2 × 730h + $10 storage ≈ $506/month.
Egress.: Average response 4 KB × 10k rps average × 86 400 sec/day × 30 days ≈ 100 TB/month. After CDN (70% hit rate), ~30 TB egress × $0.09 = $2,700/month. Pre-CDN it would be $9,000.
Load balancer + misc.: ALB ~$25/month, NAT gateway ~$45/month, CloudWatch ~$30, KMS ~$5. ~$110/month.
Padding (20%).: Add ~$800.
Total.: ~$4,800/month.

If your boss/CFO/finance team expected $1k or $100k, your architecture isn't matching their mental model. Catching that mismatch before launch is the whole point of doing this exercise.

The patterns that change the bill

Reserved instances / Savings Plans.: 1-year commitment cuts compute by ~30%, 3-year by ~50%. Free money for steady-state workloads. Don't apply to autoscaled burst capacity — keep that on-demand.
Spot / Preemptible.: 60–90% discount on compute for fault-tolerant workloads (batch, ETL, stateless replicas). The risk is interruption; the design has to tolerate it. CI runners, ML training, stateless web tiers are all good candidates.
CDN.: Cuts egress by 5–10×, plus drops latency. Almost always the first thing to do once egress matters. Cloudflare's per-GB pricing is around $0.01/GB for the standard plans; CloudFront sits around $0.02–0.05/GB at most volumes.
Lifecycle policies on S3 / GCS.: Move data older than 30 days to Glacier / Coldline. From $0.023/GB-month to $0.004/GB-month — 5× cheaper. Free if you're already on the bucket; just configure the rule.
Data compression.: Zstd compression on stored data is ~3× shrink and effectively free CPU. Applied to S3 logs, database backups, and Kafka topics it cuts storage and egress both. Most engineers don't think to enable it.
Right-sizing.: CloudWatch / Datadog will tell you which instances are running at 5% CPU. They should be smaller. Right-sizing alone routinely cuts cloud bills by 20–30% for teams that never look.

The patterns that quietly blow it up

Cross-region traffic.: Replication to a DR region. ETL pipelines that pull from prod in one region and load in another. $0.02/GB adds up — replicating 10 TB/day across regions is $6k/month. Look at your ETL job DAG before signing the MSA.
NAT gateway.: $0.045/hr fixed plus $0.045/GB processed. A NAT gateway in front of a private subnet that pulls package updates can cost more than the instances behind it. VPC endpoints to S3/DynamoDB / Interface endpoints for everything else fix this.
Logs.: CloudWatch Logs at $0.50/GB ingested. A noisy service emitting 10 GB/day burns $150/month just on ingest. Move debug logs out of CloudWatch, or downsample.
Idle resources.: Unattached EBS volumes. Stopped EC2 instances that still pay for storage. Old snapshots. NAT gateways in a VPC nobody uses anymore. The cloud bill grows from neglect, not just usage.
Data transfer between AZs.: Cross-AZ traffic in the same region is $0.01/GB each way. A multi-AZ Kafka cluster with 100 MB/s replication = $250/month per AZ pair. Worth knowing; rarely worth optimising unless throughput is enormous.

What a defensible cost estimate looks like

Six lines that turn an architecture into a number anyone can sanity-check.

Workload. "X req/s peak, Y average, Z GB stored, average response size N KB."
Compute. "Pod count, instance type, monthly subtotal."
Storage. "DB + cache + object storage, monthly subtotal."
Egress. "TB/month → GB out → after CDN, monthly subtotal."
Managed. "RDS, ElastiCache, queue, monitoring — monthly subtotal."
Total + padding. "Sum × 1.2 = estimate. Range ±25%."

Common mistakes

Counting one of everything.: One instance, one database, one load balancer. Real systems have N replicas, ≥2 AZs, primary + standby + read replicas, dev / staging / prod. Multiply early.
Forgetting egress.: The line item people don't think about, and the one that dominates at scale. Always estimate it. CDN if it's significant.
Mixing on-demand and reserved without tracking which.: You'll discover six months in that your "saved" 40% only applies to the steady baseline, not the autoscaled fleet. Track the reserved coverage as a percentage of compute hours.
Pricing managed services as raw IaaS.: RDS isn't EC2 + EBS. DynamoDB isn't pay-per-instance. Look up the actual pricing model; don't approximate from compute.
Ignoring support tier.: Business Support is 10% of monthly spend. Enterprise is 3–10%. Real money on a $1M/yr bill.

How to estimate cost.