Evidence — Agentic AI for Serious Engineers

Evidence

Numbers, not narratives. Every claim on this site links back here. Every Evidence page links its own raw data, harness code, and measurement provenance.

3.4pp

Workflow > Agent

2.4×

Cost ratio

2.2×

Latency

All Evidence

Baseline Evaluation Report

Baseline evaluation of the Document Intelligence Agent across 30 test cases covering 11 categories. Documents pass rates, failure distribution, and per-case scores using an LLM-judge rubric. Establishes the starting point before hardening passes described in Chapter 6.

63.3% Pass rate (19/30)0.68 Avg score

Measured 2026-03-26

Trace Examples

Annotated traces of three Document Intelligence Agent runs showing every step with timing, tokens, and decision points. Covers a clean pass, an escalation failure, and a multi-step tool-using run. Demonstrates how to read traces to diagnose retrieval, decision, and cost issues.

3,240 Tokens — multi-step trace (Trace 3)4,280ms Latency — multi-step trace (Trace 3)

Measured 2026-03-26

Architecture Comparison: Workflow vs Single-Agent vs Multi-Agent

Side-by-side evaluation of three architectures on the same 30 queries. Multi-agent improves pass rate by only 3.4 percentage points over single-agent but costs 2.4x more and takes 2.2x longer. Provides the empirical basis for the book's architecture selection guidance.

3.4pp Multi-agent accuracy gain over single-agent2.4x Multi-agent cost ratio vs single-agent

Measured 2026-03-26

Failure Case Studies

Five concrete failure cases from the Document Intelligence Agent baseline evaluation, each illustrating a different failure mode with root cause analysis and the fix applied during hardening. Covers escalation failure, citation fabrication, chunk boundary miss, tool argument hallucination, and budget exhaustion.

37% Baseline failure rate (11/30)5 No-citation failures (most common mode)

Measured 2026-03-26