RAG — Retrieval-Augmented Generation
Give an LLM access to your own documents at query time. The most effective pattern to get accurate, grounded answers on your data.
Pipeline overview: (1) indexing (offline) — ingest source documents, split into chunks (~200-800 tokens, overlap 10-20%), embed each chunk with an embedding model, store (chunk_text, embedding_vector, metadata) in a vector DB. (2) query (online) — embed the user query with the same model, perform a similarity search (usually k-NN over cosine similarity) returning top-k chunks, optionally rerank, assemble into the LLM prompt as context, generate answer.
Embedding models: OpenAI text-embedding-3-large (3072 dims, best quality), Cohere embed-v4 (multilingual), Voyage voyage-3-large (legal/code specialized), Google text-embedding-005. Dimensionality tradeoff: higher dim = better recall + more storage + slower search. Most production uses 768-1536 dims.
Chunking strategy is critical. Naive (fixed 500 tokens) is OK baseline; better: semantic chunking (split on paragraph/section boundaries), parent-document retrieval (small chunks for matching, parent chunk for context), hierarchical (headings + body). Overlap (10-20%) prevents the answer being split across chunks.
**Retrieval quality levers**: (i) **hybrid search** — combine vector (semantic) + BM25 (keyword) with reciprocal rank fusion; catches exact-match queries vector misses; (ii) **reranking** — a cross-encoder (Cohere Rerank, Jina Reranker, Voyage rerank-2) scores (query, chunk) pairs, boosting true matches from 10-50 retrieved candidates; (iii) **contextual retrieval** (Anthropic) — prepend each chunk with a summary of its source doc before embedding; reduces the 'what is this talking about' ambiguity and cuts retrieval failures by ~49%.
**Evaluation**: do not optimize RAG by vibes. Build a set of (question, ground-truth-answer, ground-truth-source-chunks) triples (≥100). Measure: **retrieval metrics** (recall@k, MRR — did we retrieve the right chunks?), **answer metrics** (LLM-as-judge for faithfulness + relevance). Ragas, TruLens, LangSmith, Arize Phoenix are standard tools.
Common failure modes & fixes: (i) retrieval miss — the right chunk isn't in top-k → add hybrid search, rerank, or increase k; (ii) retrieval wrong — irrelevant chunks promoted → rerank with cross-encoder; (iii) LLM ignores retrieved context — model defaults to training knowledge → stronger grounding instruction + citation requirement; (iv) stale index — docs changed but index not refreshed → scheduled reindex or event-driven updates; (v) chunking too granular — chunks lack context → parent-document retrieval or larger chunks with overlap.
When NOT to use RAG: (i) questions requiring aggregation/computation across the entire corpus (RAG is pointwise — 'sum invoice totals' won't work; use SQL); (ii) corpus < 50 docs (fits in context — just inject all of it, no retrieval needed); (iii) real-time precision-critical (RAG can miss — better to use structured search + LLM for explanation).
**Security + privacy**: retrieval returns snippets of source docs, which may contain sensitive data. Enforce user-level access control on the retrieval layer (metadata filter: `WHERE user_can_access(user_id, chunk.source_doc_id)`). Otherwise the LLM becomes a data-leaking bridge around your ACLs.
Diagram
Grounded on https://www.anthropic.com/news/contextual-retrieval
Next up
AI Agents — what they do and where they break
An agent is an LLM that plans, calls tools, and iterates until a goal is reached. Powerful for multi-step work but brittle — know when to trust one.