Dualo
AI in Practice

AI ROI & Metrics — how to measure real value

Most AI projects fail on measurement, not tech. Here's what to track to know if it's actually working and to justify scaling the budget.

2 min read

Rigorous AI ROI measurement requires (i) baseline (pre-AI state metrics), (ii) counterfactual discipline (A/B, matched-cohort, or staggered rollout), (iii) holistic metrics (not just model quality).

**Quality metrics (task-dependent)**: classification — precision, recall, F1, AUC; generation — scores with rubrics (factuality, relevance, tone), human spot-audit on sample; retrieval — recall@k, MRR; structured extraction — field-level accuracy + schema-compliance rate. Always report on a HOLDOUT eval set, not the one you iterated on.

Efficiency metrics: time per task (baseline vs AI-augmented, measured, not self-reported — use activity logs), cost per successful outcome (not per API call — include failure/rework), deflection rate (support), automation rate (document ops), throughput (tasks/hour per employee).

Adoption metrics: DAU / WAU / MAU on the AI surface, % eligible tasks using AI (most revealing — shows fraction of what could use AI that does), time-to-engagement (days from access to first-real-use), stickiness (DAU/MAU ratio), repeat usage (same user, multiple sessions).

**Risk + safety metrics**: **hallucination rate** (% of generations containing factual errors — sampled audit), **escalation miss rate** (support — AI handled when it shouldn't), **complaint volume**, **refused rate** (model declines user requests — too high = broken UX, too low = unsafe), **PII leak incidents**, ** attempt rate + block rate**.

**ROI formula (steelmanned)**: ROI = (value_delivered − total_cost) / total_cost. Value_delivered = sum(time_saved × loaded_labor_rate) + sum(errors_avoided × cost_per_error) + sum(revenue_lift × margin) + sum(retention_gain × LTV). Total_cost = API_cost + infra + engineering_build + engineering_maintenance + change_mgmt_effort + amortized_eval_infra. Most projects forget the last 3 — underestimating by 30-50%.

Experimental design: holdout A/B (some users get AI, some don't — clean causal, only works when you can split population), staggered rollout (one team/region at a time — natural counterfactual), before/after with regression-to-mean caveat (weakest — external factors confound), synthetic control (advanced — reconstruct a counterfactual cohort with pre-period weights).

**Measurement stack**: product analytics (Amplitude, Mixpanel, Posthog) for adoption/behavior; observability (Langfuse, LangSmith, Braintrust) for LLM traces + eval scores; business outcome tracking (warehouse + BI, typically dbt + Looker/Metabase); incident tracking + complaint ingestion. Tie them at user_id + task_id + session_id so you can join a trace to an outcome.

**Decision framework at 90 days**: (a) adoption > 40% AND quality ≥ baseline AND ROI positive → scale; (b) adoption < 20% → UX/change-mgmt rework before scaling; (c) quality below baseline → iterate on prompt//eval; (d) adoption OK but ROI near zero → feature is a habit, not value — deprioritize investment.

Communication upward: one-page dashboard per project (quality, efficiency, adoption, risk, ROI). Monthly review with sponsor. Reports should force the answer to 'continue / iterate / kill' — not be a progress slideshow.

Grounded on https://www.anthropic.com/customers

Next up

AI Risks — hallucinations, prompt injection, privacy

The three big classes of risk on production AI systems, and the practical mitigations that actually work.