Cost engineering.

The cloud bill is the second-largest line item on most engineering org's P&L, behind salaries and ahead of office. It also grows in ways that surprise people — a typo in an SDK config, an unbounded log loop, one cross-region replication you forgot you turned on. Cost engineering is the discipline of catching those before quarterly review does. The habits are simple. Doing them consistently is the hard part.

1 · Where the money goes

A typical bill at a mid-sized SaaS, in rough order of size:

Line	Share	Watch for
Compute (EC2 / Fargate / Lambda)	30–50%	Idle instances, over-provisioned sizes, on-demand for steady-state
Database (RDS / Aurora / DynamoDB)	15–25%	Unreserved capacity, on-demand DynamoDB on predictable workloads
Data transfer (egress + cross-AZ + cross-region)	10–20%	The silent bill. CDN-able egress not behind a CDN; NAT-bound traffic that should hit a VPC endpoint
Storage (S3 + EBS + snapshots)	5–15%	Stale snapshots, S3 Standard data that should be on IA/Glacier, EBS volumes from terminated instances
Observability (logs + metrics + traces)	5–10%	Log volume spikes, high-cardinality custom metrics, full-sample traces in production
Managed services + add-ons (KMS, Secrets Manager, X-Ray, etc.)	2–5%	Rarely the problem; mention if a single service stands out

Run the same audit against your own bill. If any line is 2× higher than typical, that's where to dig first.

2 · The compute discipline

Reservations and Savings Plans. Steady-state workloads should be 60–80% covered by Savings Plans or Reserved Instances. Discount is 30–60% versus on-demand. Three-year all-upfront saves the most; one-year no-upfront is the safer commitment for a growing company.
Spot for stateless and batch. Same hardware, 70–90% cheaper. Vanishes with 2 minutes' notice. Perfect for batch jobs, CI runners, async workers, and stateless web tiers behind an ALB that can shed an instance gracefully.
Right-sizing. Average steady-state CPU below 30% on a fleet means you can drop one or two sizes. Compute Optimizer flags these. Most teams over-provision by 2–3× because the cost of slowness is more visible than the cost of waste.
Auto-shutdown on dev/staging. Anything not in production that runs 24/7 is pure waste. Cron-stop overnight; cron-start in the morning. Schedules cut dev/staging costs by ~70%.
Graviton (or Arm equivalents). ~20% cheaper for the same performance on most workloads. Migration is mostly recompiling; check your dependencies first.

3 · The data-transfer surprises

Cloud providers charge very little to put data in, a moderate amount to move it around, and a surprising amount to take it out. Three traps:

Cross-AZ traffic. $0.01/GB each direction within a region. A microservices mesh that doesn't AZ-pin its calls can run $1000s/month in cross-AZ alone. Most service meshes (Istio, Linkerd) support topology-aware routing — turn it on.
NAT Gateway processing. $0.045/GB through a NAT Gateway. Calling S3 from a private subnet over NAT can cost more than the actual S3 calls. VPC Gateway Endpoints for S3 and DynamoDB are free; use them.
Cross-region replication. $0.02/GB. Aurora Global, DynamoDB Global Tables, S3 CRR — all line items. Replicate only what needs replicating; lifecycle-tier the cold stuff.
Internet egress. $0.05–$0.09/GB depending on region. A CDN in front of any external read traffic kills 80–95% of this. The CDN bill is smaller; the egress saving is larger.

4 · The storage discipline

S3 lifecycle policies. Auto-tier objects from Standard → IA → Glacier as they age. Set them once per bucket; saves five figures a year on a moderately full bucket. The fix nobody regrets.
EBS snapshot lifecycle. Automated daily snapshots accumulate. AWS Data Lifecycle Manager prunes them on a retention schedule. Don't keep more than a quarter of dailies unless compliance requires it.
Orphaned EBS volumes. Terminated instances leave their attached volumes behind if you didn't mark them DeleteOnTermination. Audit monthly; the volumes are still being charged.
Unused snapshots from old AMIs. Build pipelines create new AMIs; old ones accumulate. Snapshot count is the right metric to alarm on.

5 · The FinOps practices

Tag everything. Cost-allocation tags by team, product, environment, owner. Without them, "why did the bill spike?" is unanswerable.
Per-team dashboards. Each team sees its slice of the bill. Visible ownership shifts behaviour faster than any centralised pressure.
Monthly review. 30 minutes. Cost Explorer's "biggest movers." Identify the top three increases, the top three decreases. Pattern-match against deploys and feature launches.
Anomaly alerts. AWS Cost Anomaly Detection, GCP Recommender, Azure Cost Alerts. Catch the bad-config-that-2x'd-the-bill within hours, not at month-end.
Reservation coverage as a metric. Reported alongside availability and latency. Below target, action item next sprint.
Cost in the design review. When a new service is proposed, "estimated monthly run cost" is one of the line items in the design doc. Catches expensive choices before they ship.

6 · Tooling worth knowing

Tool	What it does	Notes
AWS Cost Explorer	Built-in. Time-series of spend by service, tag, account.	Free. Use it first.
AWS Compute Optimizer	Right-sizing recommendations.	Free. Catches obvious over-provisioning.
AWS Cost Anomaly Detection	Alerts on unusual spend movement.	Free. Set thresholds per team.
Vantage / CloudZero / Cloudability	Multi-account / multi-cloud cost analytics.	Paid. Sweet spot for orgs with 10+ accounts.
Spotinst (now Spot.io)	Spot orchestration with fallback to on-demand.	Paid. Saves the most when Spot is hard to manage in-house.
kubecost	K8s-aware cost attribution by namespace/deployment.	OSS + paid. Required if you run a busy K8s.
FinOut	Custom unit-economics dashboards (cost per customer, per request).	Paid. Useful for showing engineering work in revenue terms.