Dualo
AI in Practice

Prompt Engineering — the practical framework

Structured prompting pays off. Role + task + context + constraints + format + examples = reliable outputs instead of dice-rolls.

2 min read

Prompt structure (CO-STAR or similar frameworks): Context (relevant background), Objective (task), Style (tone and register), Tone (empathetic/direct/formal), Audience (who reads), Response format (JSON schema, markdown, XML tags). Adopt a consistent template across your codebase to reduce drift.

System prompt vs user prompt: models weight system prompts more strongly; put invariants there (persona, constraints, format). User prompt carries the request-specific payload. Claude specifically handles long system prompts + XML-tagged structure extremely well.

: include 2-5 input/output examples in the prompt to anchor behavior. Effective for: taxonomy-based classification, structured extraction with specific format, stylistic consistency. Limit: examples consume context budget — for 100k-document extraction, few-shot is fine; for 10M-document, fine-tune.

: ask the model to reason before answering ('Think step by step, then provide the final answer in <answer> tags'). Improves reliability on multi-step reasoning, but increases cost + latency. Extended-thinking models (Claude Opus 4 with thinking) do this natively.

Output structuring via XML / JSON: models trained on instruction-following excel at emitting structured output when asked. Use `<invoice><total>...</total></invoice>` tags rather than free text for reliable downstream parsing. JSON works but is more fragile (schema drift, escape issues). Strong pattern: XML → JSON via a deterministic parser.

Tool use / function calling: for tasks the LLM shouldn't solve alone (arithmetic, live data, database queries), expose typed tools the model can invoke. The model returns a structured tool call; your code executes it; you return the result; the model continues. Essential for agent-like behavior.

Prompt caching (Claude, OpenAI): cache the static prefix (system prompt + few-shot examples + large reference doc) across calls. 90% cost reduction + 80% latency reduction on cache hits. Worth the setup for any high-volume production flow.

Evaluation discipline: prompt changes without eval are prayers. Build an eval set of labeled input/output pairs representing real traffic, automate the scoring (exact match for structured output, LLM-as-judge for qualitative), track metrics across versions. Promptfoo, LangSmith, Braintrust, Humanloop are common tools.

Security-sensitive inputs: user input is untrusted. Sanitize or wrap in XML tags (`<user_input>...</user_input>`) to reduce risk. For high-stakes systems (customer support bots that can refund), add a classifier before the main LLM to detect and reject adversarial input.

Grounded on https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview

Next up

RAG — Retrieval-Augmented Generation

Give an LLM access to your own documents at query time. The most effective pattern to get accurate, grounded answers on your data.