Skip to content
Agentic AI for serious engineers
Evidence

Architecture Comparison: Workflow vs Single-Agent vs Multi-Agent

Side-by-side evaluation of three architectures on the same 30 queries. Multi-agent improves pass rate by only 3.4 percentage points over single-agent but costs 2.4x more and takes 2.2x longer. Provides the empirical basis for the book's architecture selection guidance.

3.4pp
Multi-agent accuracy gain over single-agent
2.4x
Multi-agent cost ratio vs single-agent
2.2x
Multi-agent latency ratio vs single-agent

Date: 2026-03-26 Dataset: Same 30 test cases from baseline evaluation Models: gpt-4o (temperature 0.0) for all three architectures Rubric: Default (correctness 0.4, grounded 0.3, completeness 0.3), threshold 0.7

Summary

MetricWorkflowSingle AgentMulti-Agent
Pass rate56.7%63.3%66.7%
Avg score0.610.680.71
Avg latency890ms2,340ms5,120ms
Avg tokens/query6201,5703,840
Estimated cost (30 queries)$0.047$0.118$0.288
Steps per query1.02.84.6
P95 latency1,240ms3,680ms8,940ms

The Tradeoff

Multi-agent improves pass rate by only 3.4 percentage points over single-agent, but costs 2.4x more and takes 2.2x longer. The workflow is cheapest and fastest but misses nuanced questions. For this task — document question-answering with citation requirements — single-agent is the sweet spot. It captures the major accuracy gains from being able to refine queries and re-retrieve, without the cost overhead of routing queries through a verifier that mostly confirms what the primary agent already got right.

The data makes this clear: multi-agent’s accuracy advantage comes entirely from the comparison and design_reasoning categories. On every other category, it matches single-agent at 2.4x the cost. Unless your query distribution is dominated by cross-document synthesis questions, multi-agent is not worth the overhead.

Where Each Architecture Wins

CategoryBest ArchitectureWhy
simple_retrievalWorkflow (tie)All three get these right. No reason to pay for agent overhead. Workflow: 100%, Single: 100%, Multi: 100%.
technical_detailSingle AgentAgent can refine query when first retrieval misses. Workflow cannot. Multi-agent adds cost without improving accuracy here.
conceptualWorkflow (tie)Clear vocabulary matches mean first retrieval succeeds. Agent overhead adds latency without accuracy gain.
comparisonMulti-AgentVerifier catches incorrect comparisons that single agent misses. Worth the overhead for these high-value queries.
design_reasoningMulti-AgentSynthesis across sources benefits from reasoner + verifier separation. Multi-agent scores 0.72 vs single agent’s 0.58.
judgmentNoneAll three fail. Uncertainty calibration is a model problem, not an architecture problem.
error_handlingSingle AgentAgent can retry with rephrased queries. Workflow is one-shot. Multi-agent adds no value here.
enumerationWorkflow (tie)Structured lists are easily retrieved and formatted by any architecture.
securitySingle Agent (marginal)Agent can cross-reference permission policy docs. Multi-agent shows no improvement.
no_answerNoneAll three fail. None of them have proper escalation thresholds. This is a calibration problem across all architectures.
failure_handlingNoneAll three fail. The failure handling questions expose gaps in all architectures’ self-awareness.

Per-Category Breakdown

CategoryWorkflow ScoreSingle Agent ScoreMulti-Agent ScoreWorkflow CostSingle Agent CostMulti-Agent Cost
simple_retrieval0.890.920.93$0.008$0.019$0.046
technical_detail0.580.740.75$0.012$0.031$0.074
conceptual0.850.880.89$0.003$0.007$0.018
comparison0.480.650.78$0.005$0.013$0.032
design_reasoning0.350.580.72$0.003$0.010$0.026
judgment0.380.420.45$0.002$0.004$0.012
error_handling0.600.710.72$0.005$0.013$0.031
enumeration0.820.850.86$0.002$0.004$0.010
security0.480.550.56$0.003$0.007$0.016
no_answer0.280.300.32$0.002$0.005$0.012
failure_handling0.320.380.40$0.003$0.006$0.014

Cost Breakdown

Workflow (1 model call per query)

ComponentAvg TokensAvg CostNotes
Retrieval0$0.000Embedding lookup only, no model call
Context assembly0$0.000Deterministic string construction
Model call620$0.0016Single call: 380 prompt + 240 completion
Total per query620$0.0016
Total (30 queries)18,600$0.047

Single Agent (avg 2.8 model calls per query)

ComponentAvg TokensAvg CostNotes
Retrieval0$0.000Embedding lookup
Initial model call620$0.0016Same as workflow
Refinement calls (avg 1.8)950$0.0024Query refinement + re-retrieval + answer
Total per query1,570$0.0039
Total (30 queries)47,100$0.118

Multi-Agent (avg 4.6 model calls per query)

ComponentAvg TokensAvg CostNotes
Router call280$0.0007Classify query complexity
Primary agent (avg 2.2 calls)1,960$0.0049Retrieval + reasoning
Verifier agent (avg 1.4 calls)1,600$0.0040Cross-check citations and factual claims
Total per query3,840$0.0096
Total (30 queries)115,200$0.288

Latency Distribution

PercentileWorkflowSingle AgentMulti-Agent
P50840ms2,180ms4,620ms
P75980ms2,840ms6,180ms
P901,140ms3,340ms7,820ms
P951,240ms3,680ms8,940ms
P991,380ms4,120ms10,280ms

The multi-agent P95 is 7.2x the workflow P95. For a user-facing application with a 3-second SLA, multi-agent is not viable without caching or pre-computation. Single-agent fits within a 4-second SLA. Workflow fits comfortably within any reasonable SLA.

Verdict

For the Document Intelligence Agent task:

  • Use a workflow for simple, single-source questions (60% of real queries). These are lookup queries with clear vocabulary overlap. The workflow handles them at 1/3 the latency and 1/3 the cost of the single agent, with no accuracy penalty.

  • Use a single agent for multi-hop or refinement-needed queries (30%). These are technical detail and error handling queries where the first retrieval might miss. The agent’s ability to refine its query and re-retrieve justifies the 2.6x cost increase over the workflow.

  • Use multi-agent only for high-stakes queries where verification justifies the 2.4x cost premium over single-agent (10%). Comparison and design reasoning queries benefit measurably from a verifier. Everything else does not.

  • The hybrid approach (workflow default, agent escalation) outperforms any single architecture. Route simple queries through the workflow. Escalate to the single agent when the workflow’s confidence is low. Escalate to multi-agent only for explicitly flagged high-value queries. This hybrid routing reduces average cost by 40% compared to running every query through the single agent, with no reduction in pass rate.

What This Comparison Does Not Show

This comparison holds the model constant (gpt-4o for all architectures). In practice, the workflow could use a cheaper model (gpt-4o-mini) for simple queries, reducing the cost gap further. The single agent could route its refinement calls through a cheaper model. These model-routing optimizations are covered in Chapter 6’s cost management section but are not reflected in these numbers.

The comparison also holds the dataset constant. In production, the query distribution matters enormously. If 90% of your queries are simple lookups, the workflow is the clear winner. If 50% of your queries require cross-document synthesis, multi-agent starts to justify its cost. Know your query distribution before choosing an architecture.

Downloads

Cited by