AI ROI & Metrics — how to measure real value

Most AI projects fail on measurement, not tech. Here's what to track to know if it's actually working and to justify scaling the budget.

Easy Technical

2 min read

Rigorous AI ROI measurement requires (i) baseline (pre-AI state metrics), (ii) counterfactual discipline (A/B, matched-cohort, or staggered rollout), (iii) holistic metrics (not just model quality).

**Quality metrics (task-dependent)**: classification — precision, recall, F1, AUC; generation — scores with rubrics (factuality, relevance, tone), human spot-audit on sample; retrieval — recall@k, MRR; structured extraction — field-level accuracy + schema-compliance rate. Always report on a HOLDOUT eval set, not the one you iterated on.

Efficiency metrics: time per task (baseline vs AI-augmented, measured, not self-reported — use activity logs), cost per successful outcome (not per API call — include failure/rework), deflection rate (support), automation rate (document ops), throughput (tasks/hour per employee).

Adoption metrics: DAU / WAU / MAU on the AI surface, % eligible tasks using AI (most revealing — shows fraction of what could use AI that does), time-to-engagement (days from access to first-real-use), stickiness (DAU/MAU ratio), repeat usage (same user, multiple sessions).

**Risk + safety metrics**: **hallucination rate** (% of generations containing factual errors — sampled audit), **escalation miss rate** (support — AI handled when it shouldn't), **complaint volume**, **refused rate** (model declines user requests — too high = broken UX, too low = unsafe), **PII leak incidents**, ** attempt rate + block rate**.

**ROI formula (steelmanned)**: ROI = (value_delivered − total_cost) / total_cost. Value_delivered = sum(time_saved × loaded_labor_rate) + sum(errors_avoided × cost_per_error) + sum(revenue_lift × margin) + sum(retention_gain × LTV). Total_cost = API_cost + infra + engineering_build + engineering_maintenance + change_mgmt_effort + amortized_eval_infra. Most projects forget the last 3 — underestimating by 30-50%.

Experimental design: holdout A/B (some users get AI, some don't — clean causal, only works when you can split population), staggered rollout (one team/region at a time — natural counterfactual), before/after with regression-to-mean caveat (weakest — external factors confound), synthetic control (advanced — reconstruct a counterfactual cohort with pre-period weights).

**Measurement stack**: product analytics (Amplitude, Mixpanel, Posthog) for adoption/behavior; observability (Langfuse, LangSmith, Braintrust) for LLM traces + eval scores; business outcome tracking (warehouse + BI, typically dbt + Looker/Metabase); incident tracking + complaint ingestion. Tie them at user_id + task_id + session_id so you can join a trace to an outcome.

**Decision framework at 90 days**: (a) adoption > 40% AND quality ≥ baseline AND ROI positive → scale; (b) adoption < 20% → UX/change-mgmt rework before scaling; (c) quality below baseline → iterate on prompt//eval; (d) adoption OK but ROI near zero → feature is a habit, not value — deprioritize investment.

Communication upward: one-page dashboard per project (quality, efficiency, adoption, risk, ROI). Monthly review with sponsor. Reports should force the answer to 'continue / iterate / kill' — not be a progress slideshow.

Grounded on https://www.anthropic.com/customers

Next up

AI Risks — hallucinations, prompt injection, privacy

The three big classes of risk on production AI systems, and the practical mitigations that actually work.