Document Automation — invoices, contracts, forms

Turn incoming PDFs / scans / emails into structured data automatically. One of the highest-ROI AI use cases in most companies.

Easy Technical

2 min read

Document automation pipeline typically combines: **ingestion** (email inbox polling, API upload, file share watcher), **** (Tesseract, Google Cloud Vision, Azure OCR, AWS Textract) for scans, **layout analysis** (page structure, tables, key-value pairs), **field extraction** ( or domain model), **validation** (schema + rules), **routing** (approval workflow).

**Modern approach (2025)**: (Claude Sonnet 4.x, GPT-4o, Gemini 2.5) take PDFs/images directly and emit structured JSON. Replaces the OCR + layout + extract multi-stage for many use cases. Quality: ~90-95% field accuracy on clean structured docs (invoices, POs), ~70-85% on messy/handwritten/low-res.

**Prompting for extraction**: provide the schema as JSON Schema or TypeScript type, supply 2-3 of correctly-extracted documents, require output in the exact schema (with null for missing fields). XML tag wrapping helps parseability. Example: `extract into schema <InvoiceSchema>{ supplier_name, supplier_vat_id, invoice_number, issue_date, due_date, currency, line_items[], subtotal, vat_amount, total }</InvoiceSchema>`.

**Validation layer** (critical — LLMs alone are not enough for finance-grade accuracy): schema validation (zod, pydantic), cross-field rules (line-items sum == subtotal ± 0.01; subtotal + VAT == total; VAT rate matches the country's legal rates), referential integrity (supplier exists in master data), business rules (amount < PO limit, date within fiscal period). Documents failing validation route to human review with highlighted anomalies.

**Human-in-the-loop UX matters more than model accuracy**. The reviewer sees the extracted fields overlaid on the source document image, can click to correct in one keystroke, with auto-relearn from corrections. Good UX cuts review time 3-5× vs retyping.

**Continuous improvement**: every human correction is a training signal. Store (source_doc, initial_extraction, corrected_extraction) triples; periodically review error patterns, update prompts or few-shot examples, run evals. Many teams never do this and plateau at 70% accuracy they could raise to 95%.

**Privacy + sovereignty**: invoices/contracts often contain PII + commercial secrets. Prefer providers with BAA (HIPAA), DPA (GDPR), and no-training-on-data guarantee. For EU: Azure AI in EU region, AWS Bedrock in EU regions, Vertex AI EU, or self-hosted (Mistral, Llama fine-tuned).

**Cost realities at scale**: $0.02-0.15 per document via frontier APIs; specialized APIs ~$0.01-0.05. For 50k docs/month: $500-5000 API cost vs typically $15-30k human cost for the equivalent throughput — 3-30× ROI on API cost alone, before counting the reviewer productivity gain.

**Integrations**: output JSON feeds ERPs (SAP, Oracle, Sage, Netsuite), accounting (Xero, QuickBooks), CRMs, or custom workflows (n8n, Zapier, Make). The model call is 10% of the project — 50% is ingestion + validation + integration, 40% is UX for the reviewer.

Grounded on https://www.anthropic.com/research

Next up

Customer Support Automation

Deflect FAQs with RAG, triage tickets with classifiers, pre-draft responses for agents. Done right, it cuts resolution time 30-50% without killing customer satisfaction.