Cloud Codex / 08
Cost engineering.
The cloud bill is the second-largest line item on most engineering org's P&L, behind salaries and ahead of office. It also grows in ways that surprise people — a typo in an SDK config, an unbounded log loop, one cross-region replication you forgot you turned on. Cost engineering is the discipline of catching those before quarterly review does. The habits are simple. Doing them consistently is the hard part.
1 · Where the money goes
A typical bill at a mid-sized SaaS, in rough order of size:
| Line | Share | Watch for |
|---|---|---|
| Compute (EC2 / Fargate / Lambda) | 30–50% | Idle instances, over-provisioned sizes, on-demand for steady-state |
| Database (RDS / Aurora / DynamoDB) | 15–25% | Unreserved capacity, on-demand DynamoDB on predictable workloads |
| Data transfer (egress + cross-AZ + cross-region) | 10–20% | The silent bill. CDN-able egress not behind a CDN; NAT-bound traffic that should hit a VPC endpoint |
| Storage (S3 + EBS + snapshots) | 5–15% | Stale snapshots, S3 Standard data that should be on IA/Glacier, EBS volumes from terminated instances |
| Observability (logs + metrics + traces) | 5–10% | Log volume spikes, high-cardinality custom metrics, full-sample traces in production |
| Managed services + add-ons (KMS, Secrets Manager, X-Ray, etc.) | 2–5% | Rarely the problem; mention if a single service stands out |
Run the same audit against your own bill. If any line is 2× higher than typical, that's where to dig first.
2 · The compute discipline
- Reservations and Savings Plans. Steady-state workloads should be 60–80% covered by Savings Plans or Reserved Instances. Discount is 30–60% versus on-demand. Three-year all-upfront saves the most; one-year no-upfront is the safer commitment for a growing company.
- Spot for stateless and batch. Same hardware, 70–90% cheaper. Vanishes with 2 minutes' notice. Perfect for batch jobs, CI runners, async workers, and stateless web tiers behind an ALB that can shed an instance gracefully.
- Right-sizing. Average steady-state CPU below 30% on a fleet means you can drop one or two sizes. Compute Optimizer flags these. Most teams over-provision by 2–3× because the cost of slowness is more visible than the cost of waste.
- Auto-shutdown on dev/staging. Anything not in production that runs 24/7 is pure waste. Cron-stop overnight; cron-start in the morning. Schedules cut dev/staging costs by ~70%.
- Graviton (or Arm equivalents). ~20% cheaper for the same performance on most workloads. Migration is mostly recompiling; check your dependencies first.
3 · The data-transfer surprises
Cloud providers charge very little to put data in, a moderate amount to move it around, and a surprising amount to take it out. Three traps:
- Cross-AZ traffic. $0.01/GB each direction within a region. A microservices mesh that doesn't AZ-pin its calls can run $1000s/month in cross-AZ alone. Most service meshes (Istio, Linkerd) support topology-aware routing — turn it on.
- NAT Gateway processing. $0.045/GB through a NAT Gateway. Calling S3 from a private subnet over NAT can cost more than the actual S3 calls. VPC Gateway Endpoints for S3 and DynamoDB are free; use them.
- Cross-region replication. $0.02/GB. Aurora Global, DynamoDB Global Tables, S3 CRR — all line items. Replicate only what needs replicating; lifecycle-tier the cold stuff.
- Internet egress. $0.05–$0.09/GB depending on region. A CDN in front of any external read traffic kills 80–95% of this. The CDN bill is smaller; the egress saving is larger.
4 · The storage discipline
- S3 lifecycle policies. Auto-tier objects from Standard → IA → Glacier as they age. Set them once per bucket; saves five figures a year on a moderately full bucket. The fix nobody regrets.
- EBS snapshot lifecycle. Automated daily snapshots accumulate. AWS Data Lifecycle Manager prunes them on a retention schedule. Don't keep more than a quarter of dailies unless compliance requires it.
- Orphaned EBS volumes. Terminated instances leave their attached volumes behind if you didn't mark them DeleteOnTermination. Audit monthly; the volumes are still being charged.
- Unused snapshots from old AMIs. Build pipelines create new AMIs; old ones accumulate. Snapshot count is the right metric to alarm on.
5 · The FinOps practices
- Tag everything. Cost-allocation tags by team, product, environment, owner. Without them, "why did the bill spike?" is unanswerable.
- Per-team dashboards. Each team sees its slice of the bill. Visible ownership shifts behaviour faster than any centralised pressure.
- Monthly review. 30 minutes. Cost Explorer's "biggest movers." Identify the top three increases, the top three decreases. Pattern-match against deploys and feature launches.
- Anomaly alerts. AWS Cost Anomaly Detection, GCP Recommender, Azure Cost Alerts. Catch the bad-config-that-2x'd-the-bill within hours, not at month-end.
- Reservation coverage as a metric. Reported alongside availability and latency. Below target, action item next sprint.
- Cost in the design review. When a new service is proposed, "estimated monthly run cost" is one of the line items in the design doc. Catches expensive choices before they ship.
6 · Tooling worth knowing
| Tool | What it does | Notes |
|---|---|---|
| AWS Cost Explorer | Built-in. Time-series of spend by service, tag, account. | Free. Use it first. |
| AWS Compute Optimizer | Right-sizing recommendations. | Free. Catches obvious over-provisioning. |
| AWS Cost Anomaly Detection | Alerts on unusual spend movement. | Free. Set thresholds per team. |
| Vantage / CloudZero / Cloudability | Multi-account / multi-cloud cost analytics. | Paid. Sweet spot for orgs with 10+ accounts. |
| Spotinst (now Spot.io) | Spot orchestration with fallback to on-demand. | Paid. Saves the most when Spot is hard to manage in-house. |
| kubecost | K8s-aware cost attribution by namespace/deployment. | OSS + paid. Required if you run a busy K8s. |
| FinOut | Custom unit-economics dashboards (cost per customer, per request). | Paid. Useful for showing engineering work in revenue terms. |
Further reading
- "Cloud FinOps" (O'Reilly). The book the FinOps Foundation grew out of. Cultural and operational rather than tooling-heavy.
- FinOps Foundation framework. Free. The capability model + principles most teams reference.
- Corey Quinn's "Last Week in AWS" newsletter. Weekly snark and signal on cloud cost moves. Surprisingly educational.
- "The Cost of Cloud, a Trillion Dollar Paradox" — Sarah Wang, Martin Casado (a16z). The case for repatriation at scale. Read it once; argue with it constructively.
- Adjacent: Compute. The largest line item, broken down.
- Adjacent: Networking. The cross-AZ and NAT traps in detail.
Found this useful?