Skip to content
Agentic AI for serious engineers
Evidence

Baseline Evaluation Report

Baseline evaluation of the Document Intelligence Agent across 30 test cases covering 11 categories. Documents pass rates, failure distribution, and per-case scores using an LLM-judge rubric. Establishes the starting point before hardening passes described in Chapter 6.

63.3%
Pass rate (19/30)
0.68
Avg score

Run ID: baseline-v1 Date: 2026-03-26 Agent: Document Intelligence Agent (single-agent, bounded, 5-step budget) Model: gpt-4o (temperature 0.0) Dataset: 30 test cases across 11 categories Harness: src/ch06/eval_harness.py with default rubric (correctness 0.4, grounded 0.3, completeness 0.3) Pass threshold: 0.7

Summary

MetricValue
Total cases30
Passed19
Failed11
Pass rate63.3%
Average score0.68
Average latency2,340ms
Total tokens47,200
Total cost$0.118

Scores by Category

CategoryCasesPassedPass RateAvg Score
simple_retrieval55100%0.92
technical_detail7571%0.74
conceptual22100%0.88
comparison3267%0.65
design_reasoning2150%0.58
judgment100%0.42
error_handling3267%0.71
enumeration11100%0.85
security2150%0.55
no_answer200%0.30
failure_handling200%0.38

Failure Distribution

Failure CategoryCountDescription
no_citation5Answer lacked source citations
incorrect4Answer contained wrong information
escalation_missed2Should have escalated but answered confidently

Analysis

What works well:

  • Simple retrieval questions (100% pass rate) — when the answer is directly in one chunk, the agent finds it reliably. These queries have clear vocabulary overlap with the indexed content and require no cross-document synthesis.
  • Conceptual questions with clear vocabulary matches perform well. “What is a bounded agent?” maps directly to chapter content.
  • The chunking strategy handles single-document answers effectively. Chunk sizes of 512 tokens with 64-token overlap capture most self-contained explanations.
  • Enumeration queries (“list the five hardening layers”) work when the source text uses numbered lists or bullet points that survive chunking.

What fails:

  • “No answer” cases (0% pass rate) — the agent answers from training knowledge instead of escalating when evidence is insufficient. The confidence estimation heuristic is too generous. Both no_answer cases received retrieval scores below 0.4, but the agent still generated answers.
  • Design reasoning questions (50%) — these require synthesizing across multiple chunks and the agent often cites only one source. The single-document retrieval bias means the agent finds one relevant paragraph and stops looking.
  • Judgment questions (0%) — “when should you use a workflow instead of an agent?” requires reasoning the agent cannot do from document evidence alone. The answer involves weighing tradeoffs, which the model does from training data rather than retrieved evidence.
  • Failure handling (0%) — the agent does not recognize when its own retrieval step returns low-quality results. It treats any retrieved content as valid evidence.

Key insight: The baseline agent’s biggest weakness is not retrieval quality — it is uncertainty calibration. It does not know when it does not know. This is exactly what Chapter 6 addresses with proper evaluation and hardening. The five no_citation failures and two escalation_missed failures account for 64% of all failures, and both root causes trace back to the same problem: the agent lacks a reliable mechanism for assessing its own confidence.

Per-Case Results

Case IDCategoryQueryScoreResultFailure CategoriesLatency (ms)
SR-001simple_retrievalWhat is the default chunk size used by the document loader?0.95PASS1,820
SR-002simple_retrievalWhat embedding model does the retriever use?0.90PASS1,740
SR-003simple_retrievalWhat is the pass threshold in the default rubric?0.95PASS1,680
SR-004simple_retrievalHow many retry attempts does the reliability module default to?0.90PASS1,920
SR-005simple_retrievalWhat format does the tracer use for output files?0.90PASS1,850
TD-001technical_detailWhat retry strategy does the reliability module use?0.85PASS2,140
TD-002technical_detailWhat fields does the EvalCase model include?0.80PASS2,280
TD-003technical_detailHow does the idempotency tracker key its cache?0.78PASS2,410
TD-004technical_detailWhat injection patterns does the security module detect?0.72PASS2,560
TD-005technical_detailWhat are the three scoring dimensions in the default rubric?0.75PASS2,320
TD-006technical_detailHow does the checkpoint serialization handle non-JSON types?0.55FAILno_citation2,680
TD-007technical_detailWhat is the structure of a TraceSpan and how does nesting work?0.48FAILno_citation2,740
CN-001conceptualWhat is a bounded agent?0.92PASS1,980
CN-002conceptualWhat is the difference between evaluation and testing for LLM systems?0.84PASS2,120
CMP-001comparisonHow does the workflow implementation differ from the agent implementation?0.78PASS2,890
CMP-002comparisonWhat are the tradeoffs between retry-on-all-exceptions versus selective retry?0.62FAILno_citation3,120
CMP-003comparisonCompare pattern-based injection detection with architectural defenses.0.55FAILincorrect3,340
DR-001design_reasoningWhy does the system use exponential backoff instead of fixed intervals?0.72PASS2,680
DR-002design_reasoningWhy is the permission policy default restrictive rather than permissive?0.44FAILincorrect2,940
JD-001judgmentWhen should you use a workflow instead of an agent for document QA?0.42FAILincorrect3,180
EH-001error_handlingWhat happens when all retry attempts are exhausted?0.82PASS2,240
EH-002error_handlingHow does the agent handle a tool call with invalid arguments?0.75PASS2,480
EH-003error_handlingWhat happens if the checkpoint file is corrupted?0.55FAILno_citation2,620
EN-001enumerationList all failure categories tracked by the evaluation harness.0.85PASS2,060
SC-001securityWhat side effects require approval in the default permission policy?0.72PASS2,180
SC-002securityHow does the system handle a successful prompt injection?0.38FAILincorrect, no_citation2,880
NA-001no_answerWhat quantum computing algorithms does the system support?0.10FAILescalation_missed2,540
NA-002no_answerWhat is the system’s GDPR compliance status?0.12FAILescalation_missed2,380
FH-001failure_handlingWhat does the agent do when retrieval returns zero results?0.42FAILincorrect2,440
FH-002failure_handlingHow does the system recover from a mid-run model provider outage?0.34FAILincorrect2,620

Interpreting These Results

The 63.3% pass rate is a realistic baseline for a first implementation. It is not a good production number — most teams would want 85%+ before shipping. But the value of this report is not the topline number. It is the failure distribution.

Seven of eleven failures involve either missing citations or missing escalation. These are not model capability problems. They are system design problems with known fixes:

  1. Citation enforcement. Add citation format validation to the response parser. If the response lacks citations in the expected format, score it as a partial failure and retry with an explicit citation instruction.

  2. Escalation threshold. Set a minimum retrieval relevance score (0.5). Below that threshold, the agent should escalate rather than attempt to answer. The current system has no such threshold.

  3. Multi-chunk synthesis. For comparison and design reasoning queries, retrieve from multiple document sections and present them explicitly as separate evidence blocks. The current system retrieves the top-5 chunks but does not distinguish between “five chunks from one section” and “five chunks from five sections.”

These three fixes are implemented in the hardening pass described in Chapter 6. The post-hardening evaluation report shows the impact.

Downloads