Observability.
Three pillars, and they only work together. Logs tell you what happened. Metrics tell you how often and how bad. Traces tell you which service did the bad thing. The cloud-native version of "observability" is shipping all three from every service to a central place, with enough structure that you can ask a question you didn't anticipate when you wrote the code. The hard parts are cost and discipline, not technology.
1 · The three pillars
- Logs. Structured event records. "Request X returned 500 because Y." Best as JSON; correlatable by trace ID. Highest-cardinality of the three.
- Metrics. Numeric time-series. Request rate, error rate, latency percentiles, queue depth. Cheap to store, fast to query, the foundation of dashboards and alerts.
- Traces. Per-request causal chains across services. "This 800ms request spent 600ms in the DB and 200ms in cache lookup." OpenTelemetry is the open standard; almost everyone speaks it now.
One bonus pillar: profiling (continuous CPU/heap sampling). Pyroscope, Datadog Profiler, AWS CodeGuru. Catches the hot function nobody noticed during code review.
2 · The AWS canonical version
| Pillar | AWS service | Notes |
|---|---|---|
| Logs | CloudWatch Logs | Default destination from every AWS service. Pay per ingest GB plus per-GB-month storage. Insights for SQL-ish queries. |
| Metrics | CloudWatch Metrics | Built-in for every AWS service. Custom metrics via PutMetricData or EMF (embedded metric format in your logs). |
| Traces | X-Ray | AWS-native distributed tracing. Adequate; not as nice as Datadog/Honeycomb but tightly integrated. |
| OpenTelemetry collector | ADOT (AWS Distro for OpenTelemetry) | If you want to ship traces/metrics to multiple destinations (X-Ray + Datadog), ADOT is the conduit. |
| Dashboards | CloudWatch Dashboards / Grafana on AMG | CW is fine for AWS-native dashboards. Grafana wins when you mix sources. |
| Alerting | CloudWatch Alarms + SNS / PagerDuty | The on-call paging chain. |
| Synthetic checks | CloudWatch Synthetics | Browser/canary scripts that ping your endpoints. The first signal something's broken. |
3 · GCP, Azure, and third-party
| Pillar | AWS | GCP | Azure | Vendor-neutral |
|---|---|---|---|---|
| Logs | CloudWatch Logs | Cloud Logging | Log Analytics (Azure Monitor) | Loki / Elastic / Datadog / Splunk |
| Metrics | CloudWatch Metrics | Cloud Monitoring | Azure Monitor Metrics | Prometheus / Datadog / Mimir |
| Traces | X-Ray | Cloud Trace | Application Insights | Jaeger / Tempo / Datadog / Honeycomb |
| Dashboards | CloudWatch / AMG | Cloud Monitoring / Looker Studio | Azure Workbooks | Grafana (everywhere) |
| SLO platform | — | — | — | Nobl9 / Datadog SLOs / Honeycomb BubbleUp |
4 · The SLO mental model
Observability without targets is just expensive plumbing. The SLO model gives you targets:
- SLI — Service Level Indicator. A measurable signal. Request success rate. P99 latency. Time to first byte.
- SLO — Service Level Objective. A target on an SLI. "99.95% of requests succeed over a rolling 30-day window."
- Error budget. The complement of the SLO. 99.95% = 0.05% downtime allowed = ~21 minutes per month. While you're under budget, you can ship features. When you've spent it, freeze deploys until you earn it back.
- SLA — Service Level Agreement. The external commitment you sign with customers, usually 1–2 nines weaker than the SLO so there's room to miss without breaching the contract.
Google's SRE book is the canonical reference. Availability patterns covers the math behind nines.
5 · What breaks
- Cardinality explosion. A metric tagged with user_id (millions of values) creates millions of time-series. Metric stores choke; bills go vertical. Strict rule: low-cardinality tags on metrics; high-cardinality dimensions go on logs/traces instead.
- Log volume. A debug log loop in one service can ship 10 TB of CloudWatch Logs in a day. Ingest cost: ~$5K. Sampling, log levels, and budgets per service are non-negotiable.
- Trace sampling. Sampling 100% of traces is too expensive at any meaningful scale. Head sampling (decide at request start) is cheap and biased; tail sampling (decide at request end, after the trace shape is known) is better but needs a collector tier.
- Alert fatigue. Alerting on the wrong thing (CPU at 80% on one box) trains the on-call to ignore alerts. Alerts should be on user-visible symptoms (SLO breaches), not on intermediate causes.
- Dashboard drift. Dashboards built for last year's architecture; nobody updates them; on-call reads them anyway. Treat dashboards as code (Grafana Terraform), reviewed at the same cadence as the services they cover.
6 · Cost note
Observability bills surprise people. Three places to look:
- Log ingest. The single largest line item at most companies. CloudWatch ingest is ~$0.50/GB. Datadog logs are ~$0.10/GB indexed plus retention. A 100 TB/month log volume is $50K on CW or $10K on DD — but DD adds per-host-monitored fees on top.
- Custom metrics. CloudWatch custom metrics are $0.30 per metric per month. A service emitting 1000 metrics × 100 instances = 100K metrics = $30K/month. The fix is histograms-not-per-instance metrics and aggressive sampling.
- Trace volume. Per-span pricing on most vendors. 1 billion spans/month at $0.50/million = $500/month on a small service; $50K on a busy one. Tail-sample to keep this in check.
Rough budget: observability should land in the 5–10% of infrastructure spend range. Below that, you're flying blind; above, you're paying tax without proportional value.
Further reading
- "Site Reliability Engineering" — Google (free online). The book that introduced SLO/error-budget thinking to the wider industry.
- "Observability Engineering" — Charity Majors, Liz Fong-Jones, George Miranda. The Honeycomb-shaped view of the world; high-cardinality structured events as the foundation.
- OpenTelemetry documentation. The standard everyone's converging on. Worth a few hours with the spec.
- Adjacent: Performance methods — USE, RED, queueing theory. The questions observability is supposed to answer.
- Adjacent: Availability patterns. The math behind the nines you're setting SLOs around.