Handbook · Vol. IV · 2026 Track III · Going horizontal · piece 5 of 5 Primer

Track III · Going horizontal

How to estimate cost.

Three line items make up 80% of every cloud bill: compute, storage, egress. The back-of-envelope formula turns an architecture diagram into a defensible monthly dollar number. With a worked example, the patterns that cut the bill (reserved, spot, CDN), and the patterns that quietly blow it up (cross-region, NAT gateways, log ingest).

Track III · Going horizontal
When one box stops being enough.
  1. Primer
    Scaling out
  2. Primer
    Load balancing
  3. Essay
    Monolith limits
  4. Primer
    Capacity planning
  5. Primer
    How to estimate cost

Cloud bills are made of three line items hiding behind dozens. Compute, storage, and network egress are 80% of every modern cloud spend. The other 20% — managed services, support, reserved-instance amortisation, data-transfer-out — confuse the picture but rarely change the conclusion. The skill is being able to estimate the bill from the architecture diagram, before you sign anything.

Every system-design exercise should end with a cost estimate. Not because the exact number matters — it'll be wrong — but because the order of magnitude tells you whether the design is sane. A page that says "we'll cache aggressively" is more credible when you can also say "that saves $40k/month at our scale."

The three things that cost money

Compute.
vCPUs and RAM, hour by hour. AWS m6i.large (2 vCPU, 8 GB) is roughly $0.10/hr on-demand, $0.06/hr with a 1-year reserved instance. A rough mental model: $72/month for a small instance, $700/month for a medium-large one (8 vCPU/32 GB), $7k/month for a beefy one (64 vCPU/256 GB). Multiply by N for your replica count. Cross-cloud comparison is within 20% — GCP and Azure price similarly.
Storage.
$0.10/GB-month for standard SSD, $0.023/GB-month for spinning disk / cold storage (S3, GCS). IOPS are a separate charge above a baseline. A 1 TB database costs $100/month in raw disk; add backups (×2), replicas (×3) and you're at $700/month. Snapshots are cheap; lifecycle policies stop them piling up.
Egress.
The line item that breaks budgets. $0.09/GB egress from AWS / GCP to the public internet, dropping at scale to about $0.05/GB. Inter-region transfer is $0.02/GB. Intra-region is free. A service that ships 100 TB/month to users costs $9k/month in egress alone. CDNs (CloudFront, Cloudflare) cut this by roughly 5–10× and are usually the first thing to do.

The back-of-envelope formula

Take the workload spec and turn it into dollars. Five lines.

  1. Compute. N hosts × instance hourly cost × 730 hours/month.
  2. Storage. Total GB (data + indexes + WAL + backups × retention) × $0.10/GB-month.
  3. Egress. Average response size × monthly request count → GB/month × $0.09/GB. Subtract whatever the CDN caches.
  4. Managed services. RDS roughly 1.5× raw EC2. ElastiCache roughly 1.3×. SQS at $0.40 per million requests is cheap until you cross 100M/day. Spanner / DynamoDB charge per RCU/WCU — actually do the multiplication.
  5. Padding. Add 20% for things you forgot (load balancers, NAT gateways, KMS, CloudWatch, support tier, IAM roles you didn't notice were chargeable).

That's it. The result is within 25% of the real bill for most architectures. If your design comes out at $50k/month, the real bill will be $40–60k — which is a useful number, not a precise one.

A worked example

The search-suggest API from the capacity-planning chapter: 50k req/s peak, 10k average, p99 30 ms target, single AZ tolerance.

Compute.
6 pods on 3 nodes (per the capacity plan), m6i.xlarge = 4 vCPU / 16 GB at ~$0.20/hr. 3 nodes × $0.20 × 730h = $438/month.
Redis.
cache.m6g.large (2 nodes for HA) at ~$0.16/hr each. 2 × $0.16 × 730h ≈ $234/month.
Postgres.
db.m6g.large + 1 read replica, 100 GB SSD. ~$0.34/hr × 2 × 730h + $10 storage ≈ $506/month.
Egress.
Average response 4 KB × 10k rps average × 86 400 sec/day × 30 days ≈ 100 TB/month. After CDN (70% hit rate), ~30 TB egress × $0.09 = $2,700/month. Pre-CDN it would be $9,000.
Load balancer + misc.
ALB ~$25/month, NAT gateway ~$45/month, CloudWatch ~$30, KMS ~$5. ~$110/month.
Padding (20%).
Add ~$800.
Total.
~$4,800/month.

If your boss/CFO/finance team expected $1k or $100k, your architecture isn't matching their mental model. Catching that mismatch before launch is the whole point of doing this exercise.

The patterns that change the bill

Reserved instances / Savings Plans.
1-year commitment cuts compute by ~30%, 3-year by ~50%. Free money for steady-state workloads. Don't apply to autoscaled burst capacity — keep that on-demand.
Spot / Preemptible.
60–90% discount on compute for fault-tolerant workloads (batch, ETL, stateless replicas). The risk is interruption; the design has to tolerate it. CI runners, ML training, stateless web tiers are all good candidates.
CDN.
Cuts egress by 5–10×, plus drops latency. Almost always the first thing to do once egress matters. Cloudflare's per-GB pricing is around $0.01/GB for the standard plans; CloudFront sits around $0.02–0.05/GB at most volumes.
Lifecycle policies on S3 / GCS.
Move data older than 30 days to Glacier / Coldline. From $0.023/GB-month to $0.004/GB-month — 5× cheaper. Free if you're already on the bucket; just configure the rule.
Data compression.
Zstd compression on stored data is ~3× shrink and effectively free CPU. Applied to S3 logs, database backups, and Kafka topics it cuts storage and egress both. Most engineers don't think to enable it.
Right-sizing.
CloudWatch / Datadog will tell you which instances are running at 5% CPU. They should be smaller. Right-sizing alone routinely cuts cloud bills by 20–30% for teams that never look.

The patterns that quietly blow it up

Cross-region traffic.
Replication to a DR region. ETL pipelines that pull from prod in one region and load in another. $0.02/GB adds up — replicating 10 TB/day across regions is $6k/month. Look at your ETL job DAG before signing the MSA.
NAT gateway.
$0.045/hr fixed plus $0.045/GB processed. A NAT gateway in front of a private subnet that pulls package updates can cost more than the instances behind it. VPC endpoints to S3/DynamoDB / Interface endpoints for everything else fix this.
Logs.
CloudWatch Logs at $0.50/GB ingested. A noisy service emitting 10 GB/day burns $150/month just on ingest. Move debug logs out of CloudWatch, or downsample.
Idle resources.
Unattached EBS volumes. Stopped EC2 instances that still pay for storage. Old snapshots. NAT gateways in a VPC nobody uses anymore. The cloud bill grows from neglect, not just usage.
Data transfer between AZs.
Cross-AZ traffic in the same region is $0.01/GB each way. A multi-AZ Kafka cluster with 100 MB/s replication = $250/month per AZ pair. Worth knowing; rarely worth optimising unless throughput is enormous.

What a defensible cost estimate looks like

Six lines that turn an architecture into a number anyone can sanity-check.

  1. Workload. "X req/s peak, Y average, Z GB stored, average response size N KB."
  2. Compute. "Pod count, instance type, monthly subtotal."
  3. Storage. "DB + cache + object storage, monthly subtotal."
  4. Egress. "TB/month → GB out → after CDN, monthly subtotal."
  5. Managed. "RDS, ElastiCache, queue, monitoring — monthly subtotal."
  6. Total + padding. "Sum × 1.2 = estimate. Range ±25%."

Common mistakes

Counting one of everything.
One instance, one database, one load balancer. Real systems have N replicas, ≥2 AZs, primary + standby + read replicas, dev / staging / prod. Multiply early.
Forgetting egress.
The line item people don't think about, and the one that dominates at scale. Always estimate it. CDN if it's significant.
Mixing on-demand and reserved without tracking which.
You'll discover six months in that your "saved" 40% only applies to the steady baseline, not the autoscaled fleet. Track the reserved coverage as a percentage of compute hours.
Pricing managed services as raw IaaS.
RDS isn't EC2 + EBS. DynamoDB isn't pay-per-instance. Look up the actual pricing model; don't approximate from compute.
Ignoring support tier.
Business Support is 10% of monthly spend. Enterprise is 3–10%. Real money on a $1M/yr bill.

What to read next

Capacity planning · primer
The exercise that produces the inputs to this one. Capacity tells you how many of what; cost tells you what it bills.
Scaling out · primer
The horizontal-growth patterns and how each affects the cost line.
CDN — anatomy · guide
The single biggest egress reducer. Worth understanding the cache key, the TTL, and the invalidation cost.
Queueing theory · learn path
The math behind sizing for utilisation, which is what you bill against.
Found this useful?