Evidence
Numbers, not narratives. Every claim on this site links back here. Every Evidence page links its own raw data, harness code, and measurement provenance.
All Evidence
Five concrete failure cases from the Document Intelligence Agent baseline evaluation, each illustrating a different failure mode with root cause analysis and the fix applied during hardening. Covers escalation failure, citation fabrication, chunk boundary miss, tool argument hallucination, and budget exhaustion.
Annotated traces of three Document Intelligence Agent runs showing every step with timing, tokens, and decision points. Covers a clean pass, an escalation failure, and a multi-step tool-using run. Demonstrates how to read traces to diagnose retrieval, decision, and cost issues.
Baseline evaluation of the Document Intelligence Agent across 30 test cases covering 11 categories. Documents pass rates, failure distribution, and per-case scores using an LLM-judge rubric. Establishes the starting point before hardening passes described in Chapter 6.
Side-by-side evaluation of three architectures on the same 30 queries. Multi-agent improves pass rate by only 3.4 percentage points over single-agent but costs 2.4x more and takes 2.2x longer. Provides the empirical basis for the book's architecture selection guidance.