LLMs for Business — what they can and can't do
A grounded view of where Large Language Models actually excel, where they fail, and the myths worth dispelling before committing any project.
LLMs are autoregressive models trained on corpus-scale text with next-token prediction objective, then aligned via / RLAIF / DPO to follow instructions. Frontier models (Claude 4.x, GPT-5, Gemini 2.5) are mixture-of-experts or dense ~70B–1T+ parameter networks, instruction-tuned, with ~200k-2M token context windows.
Capability profile (empirically measured): strong — text transformation (summarize, translate, rewrite, extract to structured JSON), classification with clear taxonomies, code completion for common languages, answering questions when context is provided. Mediocre — freeform reasoning without tools, long-chain planning > 10 steps, consistency across very long outputs, domain knowledge past training cutoff. Weak without tools — arithmetic beyond ~5 digits, accurate current-event recall, dates + durations, verifiable factual claims on niche topics.
**Failure modes**: (i) **hallucinations** — confident generation of false facts, worst in low-training-data areas (niche jurisdictions, obscure people, fresh events); mitigated by + grounding + source citation; (ii) **prompt-order sensitivity** — instructions earlier in the prompt have more influence than later; (iii) **refusal over-caution** — aligned models sometimes refuse valid questions that pattern-match to dangerous topics; (iv) **format drift over long outputs** — structure degrades after ~3-5k output tokens; (v) **mid-context recall drop** — 'lost in the middle' — facts in the middle of a long prompt are recalled worse than beginning/end.
Evaluation matters: do not ship an AI feature based on vibes. Build a labeled eval set (≥50 examples, stratified across user intents), measure baseline model performance, iterate on prompt/retrieval, track regression. Tools: LangSmith, Braintrust, Humanloop, Weights & Biases, Promptfoo for CI. Blind A/B tests with real users for the final validation.
**Cost + latency realities**: for Claude Sonnet 4.x, ~$3 / M input tokens, ~$15 / M output tokens. A 10k-input + 1k-output call ≈ $0.045. Latency ~2-5s per call for non-streaming. For scale (100k calls/day), budget the finops upfront: caching (prompt caching, 90% cheaper on cache hits), model tiering (Haiku 20× cheaper than Sonnet for simple classification), batching (Message Batches API at 50% off).
Model selection heuristic: start with the strongest model for your task, measure quality, then try cheaper/smaller models — production traffic can often run on Haiku-tier while Sonnet-tier handles edge cases. Over-provisioning the model is a common waste.
Grounded on https://www.anthropic.com/research
Next up
Prompt Engineering — the practical framework
Structured prompting pays off. Role + task + context + constraints + format + examples = reliable outputs instead of dice-rolls.