Baseline Evaluation Report¶
Run ID: baseline-v1 Date: 2026-03-26 Agent: Document Intelligence Agent (single-agent, bounded, 5-step budget) Model: gpt-4o (temperature 0.0) Dataset: 30 test cases across 11 categories Harness: src/ch06/eval_harness.py with default rubric (correctness 0.4, grounded 0.3, completeness 0.3) Pass threshold: 0.7
Summary¶
| Metric | Value |
|---|---|
| Total cases | 30 |
| Passed | 19 |
| Failed | 11 |
| Pass rate | 63.3% |
| Average score | 0.68 |
| Average latency | 2,340ms |
| Total tokens | 47,200 |
| Total cost | $0.118 |
Scores by Category¶
| Category | Cases | Passed | Pass Rate | Avg Score |
|---|---|---|---|---|
| simple_retrieval | 5 | 5 | 100% | 0.92 |
| technical_detail | 7 | 5 | 71% | 0.74 |
| conceptual | 2 | 2 | 100% | 0.88 |
| comparison | 3 | 2 | 67% | 0.65 |
| design_reasoning | 2 | 1 | 50% | 0.58 |
| judgment | 1 | 0 | 0% | 0.42 |
| error_handling | 3 | 2 | 67% | 0.71 |
| enumeration | 1 | 1 | 100% | 0.85 |
| security | 2 | 1 | 50% | 0.55 |
| no_answer | 2 | 0 | 0% | 0.30 |
| failure_handling | 2 | 0 | 0% | 0.38 |
Failure Distribution¶
| Failure Category | Count | Description |
|---|---|---|
| no_citation | 5 | Answer lacked source citations |
| incorrect | 4 | Answer contained wrong information |
| escalation_missed | 2 | Should have escalated but answered confidently |
Analysis¶
What works well:
- Simple retrieval questions (100% pass rate) -- when the answer is directly in one chunk, the agent finds it reliably. These queries have clear vocabulary overlap with the indexed content and require no cross-document synthesis.
- Conceptual questions with clear vocabulary matches perform well. "What is a bounded agent?" maps directly to chapter content.
- The chunking strategy handles single-document answers effectively. Chunk sizes of 512 tokens with 64-token overlap capture most self-contained explanations.
- Enumeration queries ("list the five hardening layers") work when the source text uses numbered lists or bullet points that survive chunking.
What fails:
- "No answer" cases (0% pass rate) -- the agent answers from training knowledge instead of escalating when evidence is insufficient. The confidence estimation heuristic is too generous. Both no_answer cases received retrieval scores below 0.4, but the agent still generated answers.
- Design reasoning questions (50%) -- these require synthesizing across multiple chunks and the agent often cites only one source. The single-document retrieval bias means the agent finds one relevant paragraph and stops looking.
- Judgment questions (0%) -- "when should you use a workflow instead of an agent?" requires reasoning the agent cannot do from document evidence alone. The answer involves weighing tradeoffs, which the model does from training data rather than retrieved evidence.
- Failure handling (0%) -- the agent does not recognize when its own retrieval step returns low-quality results. It treats any retrieved content as valid evidence.
Key insight: The baseline agent's biggest weakness is not retrieval quality -- it is uncertainty calibration. It does not know when it does not know. This is exactly what Chapter 6 addresses with proper evaluation and hardening. The five no_citation failures and two escalation_missed failures account for 64% of all failures, and both root causes trace back to the same problem: the agent lacks a reliable mechanism for assessing its own confidence.
Per-Case Results¶
| Case ID | Category | Query | Score | Result | Failure Categories | Latency (ms) |
|---|---|---|---|---|---|---|
| SR-001 | simple_retrieval | What is the default chunk size used by the document loader? | 0.95 | PASS | -- | 1,820 |
| SR-002 | simple_retrieval | What embedding model does the retriever use? | 0.90 | PASS | -- | 1,740 |
| SR-003 | simple_retrieval | What is the pass threshold in the default rubric? | 0.95 | PASS | -- | 1,680 |
| SR-004 | simple_retrieval | How many retry attempts does the reliability module default to? | 0.90 | PASS | -- | 1,920 |
| SR-005 | simple_retrieval | What format does the tracer use for output files? | 0.90 | PASS | -- | 1,850 |
| TD-001 | technical_detail | What retry strategy does the reliability module use? | 0.85 | PASS | -- | 2,140 |
| TD-002 | technical_detail | What fields does the EvalCase model include? | 0.80 | PASS | -- | 2,280 |
| TD-003 | technical_detail | How does the idempotency tracker key its cache? | 0.78 | PASS | -- | 2,410 |
| TD-004 | technical_detail | What injection patterns does the security module detect? | 0.72 | PASS | -- | 2,560 |
| TD-005 | technical_detail | What are the three scoring dimensions in the default rubric? | 0.75 | PASS | -- | 2,320 |
| TD-006 | technical_detail | How does the checkpoint serialization handle non-JSON types? | 0.55 | FAIL | no_citation | 2,680 |
| TD-007 | technical_detail | What is the structure of a TraceSpan and how does nesting work? | 0.48 | FAIL | no_citation | 2,740 |
| CN-001 | conceptual | What is a bounded agent? | 0.92 | PASS | -- | 1,980 |
| CN-002 | conceptual | What is the difference between evaluation and testing for LLM systems? | 0.84 | PASS | -- | 2,120 |
| CMP-001 | comparison | How does the workflow implementation differ from the agent implementation? | 0.78 | PASS | -- | 2,890 |
| CMP-002 | comparison | What are the tradeoffs between retry-on-all-exceptions versus selective retry? | 0.62 | FAIL | no_citation | 3,120 |
| CMP-003 | comparison | Compare pattern-based injection detection with architectural defenses. | 0.55 | FAIL | incorrect | 3,340 |
| DR-001 | design_reasoning | Why does the system use exponential backoff instead of fixed intervals? | 0.72 | PASS | -- | 2,680 |
| DR-002 | design_reasoning | Why is the permission policy default restrictive rather than permissive? | 0.44 | FAIL | incorrect | 2,940 |
| JD-001 | judgment | When should you use a workflow instead of an agent for document QA? | 0.42 | FAIL | incorrect | 3,180 |
| EH-001 | error_handling | What happens when all retry attempts are exhausted? | 0.82 | PASS | -- | 2,240 |
| EH-002 | error_handling | How does the agent handle a tool call with invalid arguments? | 0.75 | PASS | -- | 2,480 |
| EH-003 | error_handling | What happens if the checkpoint file is corrupted? | 0.55 | FAIL | no_citation | 2,620 |
| EN-001 | enumeration | List all failure categories tracked by the evaluation harness. | 0.85 | PASS | -- | 2,060 |
| SC-001 | security | What side effects require approval in the default permission policy? | 0.72 | PASS | -- | 2,180 |
| SC-002 | security | How does the system handle a successful prompt injection? | 0.38 | FAIL | incorrect, no_citation | 2,880 |
| NA-001 | no_answer | What quantum computing algorithms does the system support? | 0.10 | FAIL | escalation_missed | 2,540 |
| NA-002 | no_answer | What is the system's GDPR compliance status? | 0.12 | FAIL | escalation_missed | 2,380 |
| FH-001 | failure_handling | What does the agent do when retrieval returns zero results? | 0.42 | FAIL | incorrect | 2,440 |
| FH-002 | failure_handling | How does the system recover from a mid-run model provider outage? | 0.34 | FAIL | incorrect | 2,620 |
Interpreting These Results¶
The 63.3% pass rate is a realistic baseline for a first implementation. It is not a good production number -- most teams would want 85%+ before shipping. But the value of this report is not the topline number. It is the failure distribution.
Seven of eleven failures involve either missing citations or missing escalation. These are not model capability problems. They are system design problems with known fixes:
-
Citation enforcement. Add citation format validation to the response parser. If the response lacks citations in the expected format, score it as a partial failure and retry with an explicit citation instruction.
-
Escalation threshold. Set a minimum retrieval relevance score (0.5). Below that threshold, the agent should escalate rather than attempt to answer. The current system has no such threshold.
-
Multi-chunk synthesis. For comparison and design reasoning queries, retrieve from multiple document sections and present them explicitly as separate evidence blocks. The current system retrieves the top-5 chunks but does not distinguish between "five chunks from one section" and "five chunks from five sections."
These three fixes are implemented in the hardening pass described in Chapter 6. The post-hardening evaluation report shows the impact.