AI Agents — what they do and where they break
An agent is an LLM that plans, calls tools, and iterates until a goal is reached. Powerful for multi-step work but brittle — know when to trust one.
Agent = LLM + tools + loop. Formally: at each step, the LLM chooses between (a) emitting a final answer or (b) calling one of its exposed tools with structured arguments. The runtime executes, returns the observation, appends to conversation, invokes the LLM again. Loop until stop condition (final answer, max steps, timeout).
Tool use / function calling is the foundation. Expose typed tools (name, description, input JSON schema). Good descriptions matter more than the implementation — the LLM picks tools based on when they're useful. Ambiguous/overlapping tools lead to wrong picks.
Anthropic's guidance on agent design: start simple. Workflows (predefined sequences of LLM calls) solve many problems without full agency. Reach for agents only when control flow is truly data-dependent and can't be pre-scripted. Common workflow patterns: prompt chaining (output of call N feeds call N+1), routing (classifier picks a sub-flow), parallelization (map-reduce style).
True agent architectures: ReAct (reason + act + observe loop) is the baseline. Plan-and-execute decomposes a goal into steps upfront. Reflexion adds self-critique after failures. Multi-agent — several agents with specialized roles (planner + researcher + coder) coordinated by an orchestrator; valuable when complexity justifies it, overkill otherwise.
Safety + control mechanisms: (i) max-steps limit — hard cap on loop iterations; (ii) tool allowlists per context — a customer support agent can't call the 'delete user' tool; (iii) human-in-the-loop approvals for destructive actions — require confirmation before sending email, deleting, paying; (iv) sandboxing — agents that execute code/commands run in ephemeral containers; (v) budget caps — max dollars per task, max calls per hour.
Observability: agents are long-running and hard to debug. Essential infra: structured trace per task (LangSmith, Langfuse, OpenTelemetry-based W&B Traces, Arize Phoenix) capturing every LLM call, tool invocation, args, result, latency, cost. Without this, incident triage is impossible.
Cost realities: an agent typically consumes 5-50× tokens of a single prompt. Task that costs $0.05 as one-shot might cost $2-5 as an agent loop. Always budget per-task, monitor P95 cost, cache aggressively (prompt caching on the shared prefix across iterations).
Evaluation: define task completion criteria upfront, run agents against a benchmark set, measure success rate, cost, latency, steps. Most organizations lack rigorous agent evals and deploy on anecdotes — a recipe for incidents.
Production maturity signals: (i) retryable on failure (idempotent tools); (ii) partial progress persistence (resume after crash); (iii) per-action authorization integrated with IAM; (iv) rate limiting + backpressure on tool calls; (v) regression-tested prompts in CI.
Diagram
Grounded on https://www.anthropic.com/research/building-effective-agents
Next up
Document Automation — invoices, contracts, forms
Turn incoming PDFs / scans / emails into structured data automatically. One of the highest-ROI AI use cases in most companies.