Dualo
AI in Practice

Prompting vs RAG vs Fine-tuning — how to choose

Which technique fits your problem? A decision framework by use case, with real-world trade-offs (cost, complexity, freshness, control).

2 min read

Three levers to adapt a frontier model: **prompting + in-context learning**, **retrieval-augmented generation (RAG)**, **fine-tuning** (or parameter-efficient fine-tuning, PEFT — LoRA, DoRA). Each shifts a different axis: prompting changes instructions (knowledge encoding stays in base weights); RAG injects external knowledge at query time; fine-tuning modifies weights for stable style/format/specialized vocab.

Prompting depth: baseline prompt → add few-shot (5-20 examples) → chain-of-thought → structured output (XML/JSON tags) → self-critique loops. Covers ~80% of typical needs. Eval discipline is the main differentiator between good prompting and random.

**RAG vs prompt-extending context**: if your reference corpus fits in the context window (< 200k tokens effectively), just put it in the prompt (with prompt caching for cost). RAG overhead is only worth it above that threshold or for freshness requirements.

Fine-tuning situations where it's genuinely the right answer: (i) consistent style/format — legal template generation, specific voice/tone for a brand at high volume; (ii) specialized vocabulary — medical coding, pharma drug names, proprietary acronyms that confuse frontier models; (iii) cost/latency at scale — replace Sonnet at $3/1M tokens with a fine-tuned Haiku or open-source at 10-100× lower cost, after volume justifies the effort; (iv) domain-specific reasoning — long but rare; usually RAG + good prompts beats fine-tune for pure knowledge.

**Fine-tuning technical reality**: need labeled dataset (typically 500-50k high-quality input/output pairs), compute (A100/H100 for small models; frontier base fine-tuning is only offered via API providers like OpenAI/Anthropic), inference hosting (you pay idle time unless using serverless inference). Parameter-efficient methods (LoRA) shrink adapter size dramatically, letting you stack multiple task-specific adapters on one base model.

Evaluation = the meta-question: whichever approach you pick, you must have a labeled eval set (≥100 examples, real traffic distribution) to measure improvements. Without evals, you can't compare prompt vs RAG vs fine-tune — you're gambling.

**Hybrid patterns (the norm in production)**: (i) **prompt + RAG** — most common; frontier model with retrieved context; (ii) **RAG + fine-tune** — small fine-tuned model for the final generation, frontier for complex queries; (iii) **router + specialized fine-tunes** — classify query, route to task-specific fine-tuned model for known intents, fallback to frontier for long-tail; (iv) **fine-tuned + RAG** — fine-tune the retriever / reranker rather than the generator; cheaper, fast wins.

Decision flowchart: start with prompt + eval → measure gap → if factual gap + corpus exists, add RAG → if style/format persists as gap after 20+ examples, consider fine-tune → if cost/latency at scale is the bottleneck and task is stable, fine-tune smaller model + distill.

Anti-pattern: 'We have a unique domain so we must fine-tune'. Usually false — RAG + good prompt captures 95% of domain knowledge at 10× less cost and stays fresh without retraining. Fine-tune only after you've pushed prompting + RAG to their limits AND have a clear scale justification.

Grounded on https://www.anthropic.com/research

Next up

AI ROI & Metrics — how to measure real value

Most AI projects fail on measurement, not tech. Here's what to track to know if it's actually working and to justify scaling the budget.