Observability — logs, metrics, traces, SLOs
The three pillars (logs, metrics, traces) tell you WHAT broke. SLIs/SLOs tell you if it MATTERS. You need both.
**Logs**: structured JSON over raw text. Include (request_id, trace_id) for joining across services. Log levels: ERROR (actionable), WARN (possible issue), INFO (milestone), DEBUG (diagnosis). Production usually filters at INFO; DEBUG behind a flag. Sink: Elasticsearch / OpenSearch, Loki (Grafana), Datadog, Splunk, CloudWatch Logs, Vector for processing.
**Metrics types** (Prometheus model): (i) **Counter** — monotonic, only increases (requests_total); (ii) **Gauge** — instantaneous value (active_connections); (iii) **Histogram** — pre-bucketed distribution (request_duration_seconds with buckets); (iv) **Summary** — quantiles computed client-side. Histograms over summaries for aggregatability across instances.
Four golden signals (Google SRE): Latency (time to serve a request — success vs failure separately), Traffic (demand on the system — req/s), Errors (rate of failed requests), Saturation (how 'full' the service is — CPU %, queue depth, thread pool). Track these four per service as the baseline.
RED method (for request-driven services): Rate (requests per second), Errors (failure rate), Duration (latency distribution). Simple and useful dashboard: 3 charts per service.
USE method (for resources): Utilization (% busy), Saturation (queue length or wait), Errors (hardware or saturation-induced). Useful for node-level, disk I/O, network.
**Tracing**: as the cross-vendor standard. Instrument code to emit spans (unit of work); spans have parent_id → form a tree per request. Span contains: name, start/end time, attributes (user_id, query, etc.), events, links. Backends: Jaeger (open), Tempo (Grafana), Honeycomb, Datadog APM, New Relic.
SLO engineering: pick SLIs representative of user experience (request success rate at the edge, not internal subsystem health). Define SLO target with a window (rolling 30 days). Error budget drives policy: (i) spent 100% → freeze feature work, focus on reliability; (ii) healthy budget → ship faster. Prevents both 'never ship' (too conservative) and 'break prod constantly' (no discipline).
**Alerting**: alert on SLO burn rate (fraction of error budget consumed per unit time) rather than raw thresholds. Multi-window burn-rate alerts (short window + long window both elevated) reduce false positives. Every alert should include: (a) the SLO impacted, (b) a runbook link, (c) recent deploys / changes to correlate.
Correlation across pillars: logs contain trace_id, metrics have exemplars linking to a trace, traces include span events that link to logs. Modern platforms (Datadog, Grafana stack, Honeycomb, Splunk O11y) unify the view.
Grounded on https://sre.google/sre-book/monitoring-distributed-systems/
Next up
Architecture patterns — monolith, microservices, CQRS, event sourcing
The big four architectural decisions. Pick by team size, domain complexity, and scale — not by hype.