Observability — logs, metrics, traces, SLOs

The three pillars (logs, metrics, traces) tell you WHAT broke. SLIs/SLOs tell you if it MATTERS. You need both.

2 min read

**Logs**: structured JSON over raw text. Include (request_id, trace_id) for joining across services. Log levels: ERROR (actionable), WARN (possible issue), INFO (milestone), DEBUG (diagnosis). Production usually filters at INFO; DEBUG behind a flag. Sink: Elasticsearch / OpenSearch, Loki (Grafana), Datadog, Splunk, CloudWatch Logs, Vector for processing.

**Metrics types** (Prometheus model): (i) **Counter** — monotonic, only increases (requests_total); (ii) **Gauge** — instantaneous value (active_connections); (iii) **Histogram** — pre-bucketed distribution (request_duration_seconds with buckets); (iv) **Summary** — quantiles computed client-side. Histograms over summaries for aggregatability across instances.

Four golden signals (Google SRE): Latency (time to serve a request — success vs failure separately), Traffic (demand on the system — req/s), Errors (rate of failed requests), Saturation (how 'full' the service is — CPU %, queue depth, thread pool). Track these four per service as the baseline.

RED method (for request-driven services): Rate (requests per second), Errors (failure rate), Duration (latency distribution). Simple and useful dashboard: 3 charts per service.

USE method (for resources): Utilization (% busy), Saturation (queue length or wait), Errors (hardware or saturation-induced). Useful for node-level, disk I/O, network.

**Tracing**: as the cross-vendor standard. Instrument code to emit spans (unit of work); spans have parent_id → form a tree per request. Span contains: name, start/end time, attributes (user_id, query, etc.), events, links. Backends: Jaeger (open), Tempo (Grafana), Honeycomb, Datadog APM, New Relic.

SLO engineering: pick SLIs representative of user experience (request success rate at the edge, not internal subsystem health). Define SLO target with a window (rolling 30 days). Error budget drives policy: (i) spent 100% → freeze feature work, focus on reliability; (ii) healthy budget → ship faster. Prevents both 'never ship' (too conservative) and 'break prod constantly' (no discipline).

**Alerting**: alert on SLO burn rate (fraction of error budget consumed per unit time) rather than raw thresholds. Multi-window burn-rate alerts (short window + long window both elevated) reduce false positives. Every alert should include: (a) the SLO impacted, (b) a runbook link, (c) recent deploys / changes to correlate.

Correlation across pillars: logs contain trace_id, metrics have exemplars linking to a trace, traces include span events that link to logs. Modern platforms (Datadog, Grafana stack, Honeycomb, Splunk O11y) unify the view.

Grounded on https://sre.google/sre-book/monitoring-distributed-systems/

Next up

Architecture patterns — monolith, microservices, CQRS, event sourcing

The big four architectural decisions. Pick by team size, domain complexity, and scale — not by hype.