Observability.

Three pillars, and they only work together. Logs tell you what happened. Metrics tell you how often and how bad. Traces tell you which service did the bad thing. The cloud-native version of "observability" is shipping all three from every service to a central place, with enough structure that you can ask a question you didn't anticipate when you wrote the code. The hard parts are cost and discipline, not technology.

1 · The three pillars

Logs. Structured event records. "Request X returned 500 because Y." Best as JSON; correlatable by trace ID. Highest-cardinality of the three.
Metrics. Numeric time-series. Request rate, error rate, latency percentiles, queue depth. Cheap to store, fast to query, the foundation of dashboards and alerts.
Traces. Per-request causal chains across services. "This 800ms request spent 600ms in the DB and 200ms in cache lookup." OpenTelemetry is the open standard; almost everyone speaks it now.

One bonus pillar: profiling (continuous CPU/heap sampling). Pyroscope, Datadog Profiler, AWS CodeGuru. Catches the hot function nobody noticed during code review.

2 · The AWS canonical version

Pillar	AWS service	Notes
Logs	CloudWatch Logs	Default destination from every AWS service. Pay per ingest GB plus per-GB-month storage. Insights for SQL-ish queries.
Metrics	CloudWatch Metrics	Built-in for every AWS service. Custom metrics via PutMetricData or EMF (embedded metric format in your logs).
Traces	X-Ray	AWS-native distributed tracing. Adequate; not as nice as Datadog/Honeycomb but tightly integrated.
OpenTelemetry collector	ADOT (AWS Distro for OpenTelemetry)	If you want to ship traces/metrics to multiple destinations (X-Ray + Datadog), ADOT is the conduit.
Dashboards	CloudWatch Dashboards / Grafana on AMG	CW is fine for AWS-native dashboards. Grafana wins when you mix sources.
Alerting	CloudWatch Alarms + SNS / PagerDuty	The on-call paging chain.
Synthetic checks	CloudWatch Synthetics	Browser/canary scripts that ping your endpoints. The first signal something's broken.

3 · GCP, Azure, and third-party

Pillar	AWS	GCP	Azure	Vendor-neutral
Logs	CloudWatch Logs	Cloud Logging	Log Analytics (Azure Monitor)	Loki / Elastic / Datadog / Splunk
Metrics	CloudWatch Metrics	Cloud Monitoring	Azure Monitor Metrics	Prometheus / Datadog / Mimir
Traces	X-Ray	Cloud Trace	Application Insights	Jaeger / Tempo / Datadog / Honeycomb
Dashboards	CloudWatch / AMG	Cloud Monitoring / Looker Studio	Azure Workbooks	Grafana (everywhere)
SLO platform	—	—	—	Nobl9 / Datadog SLOs / Honeycomb BubbleUp

The real choice is "cloud-native or third-party." CloudWatch / Cloud Monitoring / Azure Monitor are integrated and cheap until they aren't. Datadog / Honeycomb / New Relic are more expensive per GB but the UX gap is real. Most serious teams end up running a hybrid: cloud-native for infrastructure metrics and logs, a third-party for traces and APM where the analytical UX matters. OpenTelemetry as the wire format keeps your options open.

4 · The SLO mental model

Observability without targets is just expensive plumbing. The SLO model gives you targets:

SLI — Service Level Indicator. A measurable signal. Request success rate. P99 latency. Time to first byte.
SLO — Service Level Objective. A target on an SLI. "99.95% of requests succeed over a rolling 30-day window."
Error budget. The complement of the SLO. 99.95% = 0.05% downtime allowed = ~21 minutes per month. While you're under budget, you can ship features. When you've spent it, freeze deploys until you earn it back.
SLA — Service Level Agreement. The external commitment you sign with customers, usually 1–2 nines weaker than the SLO so there's room to miss without breaching the contract.

Google's SRE book is the canonical reference. Availability patterns covers the math behind nines.

5 · What breaks

Cardinality explosion. A metric tagged with user_id (millions of values) creates millions of time-series. Metric stores choke; bills go vertical. Strict rule: low-cardinality tags on metrics; high-cardinality dimensions go on logs/traces instead.
Log volume. A debug log loop in one service can ship 10 TB of CloudWatch Logs in a day. Ingest cost: ~$5K. Sampling, log levels, and budgets per service are non-negotiable.
Trace sampling. Sampling 100% of traces is too expensive at any meaningful scale. Head sampling (decide at request start) is cheap and biased; tail sampling (decide at request end, after the trace shape is known) is better but needs a collector tier.
Alert fatigue. Alerting on the wrong thing (CPU at 80% on one box) trains the on-call to ignore alerts. Alerts should be on user-visible symptoms (SLO breaches), not on intermediate causes.
Dashboard drift. Dashboards built for last year's architecture; nobody updates them; on-call reads them anyway. Treat dashboards as code (Grafana Terraform), reviewed at the same cadence as the services they cover.

6 · Cost note

Observability bills surprise people. Three places to look:

Log ingest. The single largest line item at most companies. CloudWatch ingest is ~$0.50/GB. Datadog logs are ~$0.10/GB indexed plus retention. A 100 TB/month log volume is $50K on CW or $10K on DD — but DD adds per-host-monitored fees on top.
Custom metrics. CloudWatch custom metrics are $0.30 per metric per month. A service emitting 1000 metrics × 100 instances = 100K metrics = $30K/month. The fix is histograms-not-per-instance metrics and aggressive sampling.
Trace volume. Per-span pricing on most vendors. 1 billion spans/month at $0.50/million = $500/month on a small service; $50K on a busy one. Tail-sample to keep this in check.

Rough budget: observability should land in the 5–10% of infrastructure spend range. Below that, you're flying blind; above, you're paying tax without proportional value.

Observability.

1 · The three pillars

2 · The AWS canonical version

3 · GCP, Azure, and third-party

4 · The SLO mental model

5 · What breaks

6 · Cost note

Further reading

Cost engineering →