08 / 08
Cloud Codex / 08

Cost engineering.

The cloud bill is the second-largest line item on most engineering org's P&L, behind salaries and ahead of office. It also grows in ways that surprise people — a typo in an SDK config, an unbounded log loop, one cross-region replication you forgot you turned on. Cost engineering is the discipline of catching those before quarterly review does. The habits are simple. Doing them consistently is the hard part.


1 · Where the money goes

A typical bill at a mid-sized SaaS, in rough order of size:

LineShareWatch for
Compute (EC2 / Fargate / Lambda)30–50%Idle instances, over-provisioned sizes, on-demand for steady-state
Database (RDS / Aurora / DynamoDB)15–25%Unreserved capacity, on-demand DynamoDB on predictable workloads
Data transfer (egress + cross-AZ + cross-region)10–20%The silent bill. CDN-able egress not behind a CDN; NAT-bound traffic that should hit a VPC endpoint
Storage (S3 + EBS + snapshots)5–15%Stale snapshots, S3 Standard data that should be on IA/Glacier, EBS volumes from terminated instances
Observability (logs + metrics + traces)5–10%Log volume spikes, high-cardinality custom metrics, full-sample traces in production
Managed services + add-ons (KMS, Secrets Manager, X-Ray, etc.)2–5%Rarely the problem; mention if a single service stands out

Run the same audit against your own bill. If any line is 2× higher than typical, that's where to dig first.

2 · The compute discipline

  • Reservations and Savings Plans. Steady-state workloads should be 60–80% covered by Savings Plans or Reserved Instances. Discount is 30–60% versus on-demand. Three-year all-upfront saves the most; one-year no-upfront is the safer commitment for a growing company.
  • Spot for stateless and batch. Same hardware, 70–90% cheaper. Vanishes with 2 minutes' notice. Perfect for batch jobs, CI runners, async workers, and stateless web tiers behind an ALB that can shed an instance gracefully.
  • Right-sizing. Average steady-state CPU below 30% on a fleet means you can drop one or two sizes. Compute Optimizer flags these. Most teams over-provision by 2–3× because the cost of slowness is more visible than the cost of waste.
  • Auto-shutdown on dev/staging. Anything not in production that runs 24/7 is pure waste. Cron-stop overnight; cron-start in the morning. Schedules cut dev/staging costs by ~70%.
  • Graviton (or Arm equivalents). ~20% cheaper for the same performance on most workloads. Migration is mostly recompiling; check your dependencies first.

3 · The data-transfer surprises

Cloud providers charge very little to put data in, a moderate amount to move it around, and a surprising amount to take it out. Three traps:

  • Cross-AZ traffic. $0.01/GB each direction within a region. A microservices mesh that doesn't AZ-pin its calls can run $1000s/month in cross-AZ alone. Most service meshes (Istio, Linkerd) support topology-aware routing — turn it on.
  • NAT Gateway processing. $0.045/GB through a NAT Gateway. Calling S3 from a private subnet over NAT can cost more than the actual S3 calls. VPC Gateway Endpoints for S3 and DynamoDB are free; use them.
  • Cross-region replication. $0.02/GB. Aurora Global, DynamoDB Global Tables, S3 CRR — all line items. Replicate only what needs replicating; lifecycle-tier the cold stuff.
  • Internet egress. $0.05–$0.09/GB depending on region. A CDN in front of any external read traffic kills 80–95% of this. The CDN bill is smaller; the egress saving is larger.

4 · The storage discipline

  • S3 lifecycle policies. Auto-tier objects from Standard → IA → Glacier as they age. Set them once per bucket; saves five figures a year on a moderately full bucket. The fix nobody regrets.
  • EBS snapshot lifecycle. Automated daily snapshots accumulate. AWS Data Lifecycle Manager prunes them on a retention schedule. Don't keep more than a quarter of dailies unless compliance requires it.
  • Orphaned EBS volumes. Terminated instances leave their attached volumes behind if you didn't mark them DeleteOnTermination. Audit monthly; the volumes are still being charged.
  • Unused snapshots from old AMIs. Build pipelines create new AMIs; old ones accumulate. Snapshot count is the right metric to alarm on.

5 · The FinOps practices

  • Tag everything. Cost-allocation tags by team, product, environment, owner. Without them, "why did the bill spike?" is unanswerable.
  • Per-team dashboards. Each team sees its slice of the bill. Visible ownership shifts behaviour faster than any centralised pressure.
  • Monthly review. 30 minutes. Cost Explorer's "biggest movers." Identify the top three increases, the top three decreases. Pattern-match against deploys and feature launches.
  • Anomaly alerts. AWS Cost Anomaly Detection, GCP Recommender, Azure Cost Alerts. Catch the bad-config-that-2x'd-the-bill within hours, not at month-end.
  • Reservation coverage as a metric. Reported alongside availability and latency. Below target, action item next sprint.
  • Cost in the design review. When a new service is proposed, "estimated monthly run cost" is one of the line items in the design doc. Catches expensive choices before they ship.

6 · Tooling worth knowing

ToolWhat it doesNotes
AWS Cost ExplorerBuilt-in. Time-series of spend by service, tag, account.Free. Use it first.
AWS Compute OptimizerRight-sizing recommendations.Free. Catches obvious over-provisioning.
AWS Cost Anomaly DetectionAlerts on unusual spend movement.Free. Set thresholds per team.
Vantage / CloudZero / CloudabilityMulti-account / multi-cloud cost analytics.Paid. Sweet spot for orgs with 10+ accounts.
Spotinst (now Spot.io)Spot orchestration with fallback to on-demand.Paid. Saves the most when Spot is hard to manage in-house.
kubecostK8s-aware cost attribution by namespace/deployment.OSS + paid. Required if you run a busy K8s.
FinOutCustom unit-economics dashboards (cost per customer, per request).Paid. Useful for showing engineering work in revenue terms.

Further reading

  • "Cloud FinOps" (O'Reilly). The book the FinOps Foundation grew out of. Cultural and operational rather than tooling-heavy.
  • FinOps Foundation framework. Free. The capability model + principles most teams reference.
  • Corey Quinn's "Last Week in AWS" newsletter. Weekly snark and signal on cloud cost moves. Surprisingly educational.
  • "The Cost of Cloud, a Trillion Dollar Paradox" — Sarah Wang, Martin Casado (a16z). The case for repatriation at scale. Read it once; argue with it constructively.
  • Adjacent: Compute. The largest line item, broken down.
  • Adjacent: Networking. The cross-AZ and NAT traps in detail.
Found this useful?