AI Risks — hallucinations, prompt injection, privacy

The three big classes of risk on production AI systems, and the practical mitigations that actually work.

2 min read

Hallucinations: mitigations layered: (i) grounding — RAG with strict 'answer only from provided context, else say you don't know' instruction + require citations; (ii) verification — second-pass LLM checks the answer against sources; (iii) self-consistency — sample N responses with temperature > 0, keep the majority-vote answer; (iv) structured generation — constrained decoding (JSON schema, grammar) prevents free-form invention of fields; (v) confidence scores — emit explicit uncertainty markers; (vi) human-in-the-loop on high-stakes outputs (medical, legal, financial).

Prompt injection — two flavors: direct — user types malicious prompt ('ignore previous instructions'); indirect — external content (doc, email, web page, search result) consumed by the LLM contains injected instructions invisible to the user but executed by the model. The indirect variant is the harder and growing threat vector as agents read more third-party data.

Prompt injection defenses: (i) input/output separation — wrap untrusted content in XML tags (<untrusted_document>...</untrusted_document>) with instructions to treat as data only; (ii) least privilege on tools — the bot has no tool that can do damage, or every destructive tool requires human approval; (iii) dual-LLM pattern — a classifier LLM inspects each input + output for policy violations BEFORE the action LLM sees it; (iv) allow-list outputs — destination actions accept only structured, validated outputs (not free text); (v) content security policies on embedded links/code; (vi) prompt hardening — explicit resistance clauses, tested against an adversarial test suite.

Secret + data leakage vectors: (i) user-typed secrets retained in provider logs (use zero-retention endpoint or redact at ingestion); (ii) RAG retrieving docs the user shouldn't access (enforce ACL on retrieval query, not just display); (iii) training data memorization (mitigate with de-duplication, Rényi privacy, DP-SGD for fine-tuning sensitive data; don't train on user data unless contractually authorized); (iv) debug/ops channels — error messages containing prompts; (v) cache poisoning — prompt cache across tenants without scoping.

Zero-retention / no-training guarantees: Anthropic Claude Platform, OpenAI Enterprise, AWS Bedrock, Azure OpenAI, Vertex AI all offer contractual commitments not to train on your data and typically 0-30 day retention. Free consumer tiers (ChatGPT free, Claude consumer) do NOT offer the same guarantees — route production through enterprise/API.

PII handling pipeline: (i) detect at ingest (Presidio, AWS Comprehend, Google DLP); (ii) redact or tokenize; (iii) process with LLM on redacted version; (iv) re-hydrate tokens in output if needed (with audit log); (v) store only what's required. Critical for healthcare (HIPAA), finance (PCI), EU (GDPR).

Eval for safety: adversarial test suite treated as regression tests — known jailbreaks, prompt-injection patterns, boundary cases, high-risk categories — run in CI on every prompt change. Tools: Promptfoo, Giskard, NIST AI Risk Management Framework assessments, Microsoft PyRIT.

Observability for risk: every prompt, every tool call, every LLM response logged with correlation IDs. Anomaly detection on (i) unusual tool call patterns, (ii) refused rate spikes, (iii) long-form PII in outputs, (iv) cost spikes, (v) repeated retries (sign of jailbreak probing). Incident response: ability to kill-switch a prompt/tool, rotate keys, flush cache.

Governance framing: AI risk should be owned jointly by Security (technical controls), Legal/Privacy (compliance), and Product (user experience). Treat AI features like any external-facing system — threat modeling, PIA (Privacy Impact Assessment), DPIA for GDPR, design reviews before ship.

Grounded on https://owasp.org/www-project-top-10-for-large-language-model-applications/