Observability — logs, metrics, traces, SLOs

The three pillars (logs, metrics, traces) tell you WHAT broke. SLIs/SLOs tell you if it MATTERS. You need both.

1 min read

is your ability to **understand what's happening inside a system from its outputs**. Three pillars, each answering a different question:

**Logs** = text records of events. 'User 42 logged in at 15:03:22'. Great for post-mortem, for digging into a specific incident. Unstructured text → hard to aggregate. Structured logs (JSON) → queryable in log systems (Elastic, Datadog, Loki, CloudWatch).

Metrics = numeric time series. Requests-per-second, error rate, p99 latency, CPU %. Cheap to store, easy to alert on. Prometheus, Datadog, CloudWatch, Grafana Cloud. The primary signal for 'is everything OK right now?'.

Traces = the path of a request through services. User click → API → auth service → DB → cache → response, with timing for each hop. Essential to diagnose 'why is this specific request slow?' in microservices. OpenTelemetry, Jaeger, Datadog APM, Honeycomb.

: (i) **SLI** (Service Level Indicator) = measurable metric (availability %, p99 latency); (ii) **SLO** (Objective) = the target ('99.9% of requests < 300ms'); (iii) **SLA** (Agreement) = the contract with customers, usually lower than SLO with penalties. = 1 - SLO. If SLO is 99.9%, error budget = 0.1% = ~44 min/month.

**Dashboards vs alerts**: dashboards = 'let me look at the health'. Alerts = 'tell me when something's wrong'. Too many alerts = alert fatigue → ignored. Alert only on symptoms that wake someone up (error rate spike, SLO burn) — not every CPU spike.

Grounded on https://sre.google/sre-book/monitoring-distributed-systems/

Next up

Architecture patterns — monolith, microservices, CQRS, event sourcing

The big four architectural decisions. Pick by team size, domain complexity, and scale — not by hype.