Skip to content

Architecture Comparison: Workflow vs Single-Agent vs Multi-Agent

Date: 2026-03-26 Dataset: Same 30 test cases from baseline evaluation Models: gpt-4o (temperature 0.0) for all three architectures Rubric: Default (correctness 0.4, grounded 0.3, completeness 0.3), threshold 0.7

Summary

Metric Workflow Single Agent Multi-Agent
Pass rate 56.7% 63.3% 66.7%
Avg score 0.61 0.68 0.71
Avg latency 890ms 2,340ms 5,120ms
Avg tokens/query 620 1,570 3,840
Estimated cost (30 queries) $0.047 $0.118 $0.288
Steps per query 1.0 2.8 4.6
P95 latency 1,240ms 3,680ms 8,940ms

The Tradeoff

Multi-agent improves pass rate by only 3.4 percentage points over single-agent, but costs 2.4x more and takes 2.2x longer. The workflow is cheapest and fastest but misses nuanced questions. For this task -- document question-answering with citation requirements -- single-agent is the sweet spot. It captures the major accuracy gains from being able to refine queries and re-retrieve, without the cost overhead of routing queries through a verifier that mostly confirms what the primary agent already got right.

The data makes this clear: multi-agent's accuracy advantage comes entirely from the comparison and design_reasoning categories. On every other category, it matches single-agent at 2.4x the cost. Unless your query distribution is dominated by cross-document synthesis questions, multi-agent is not worth the overhead.

Where Each Architecture Wins

Category Best Architecture Why
simple_retrieval Workflow (tie) All three get these right. No reason to pay for agent overhead. Workflow: 100%, Single: 100%, Multi: 100%.
technical_detail Single Agent Agent can refine query when first retrieval misses. Workflow cannot. Multi-agent adds cost without improving accuracy here.
conceptual Workflow (tie) Clear vocabulary matches mean first retrieval succeeds. Agent overhead adds latency without accuracy gain.
comparison Multi-Agent Verifier catches incorrect comparisons that single agent misses. Worth the overhead for these high-value queries.
design_reasoning Multi-Agent Synthesis across sources benefits from reasoner + verifier separation. Multi-agent scores 0.72 vs single agent's 0.58.
judgment None All three fail. Uncertainty calibration is a model problem, not an architecture problem.
error_handling Single Agent Agent can retry with rephrased queries. Workflow is one-shot. Multi-agent adds no value here.
enumeration Workflow (tie) Structured lists are easily retrieved and formatted by any architecture.
security Single Agent (marginal) Agent can cross-reference permission policy docs. Multi-agent shows no improvement.
no_answer None All three fail. None of them have proper escalation thresholds. This is a calibration problem across all architectures.
failure_handling None All three fail. The failure handling questions expose gaps in all architectures' self-awareness.

Per-Category Breakdown

Category Workflow Score Single Agent Score Multi-Agent Score Workflow Cost Single Agent Cost Multi-Agent Cost
simple_retrieval 0.89 0.92 0.93 $0.008 $0.019 $0.046
technical_detail 0.58 0.74 0.75 $0.012 $0.031 $0.074
conceptual 0.85 0.88 0.89 $0.003 $0.007 $0.018
comparison 0.48 0.65 0.78 $0.005 $0.013 $0.032
design_reasoning 0.35 0.58 0.72 $0.003 $0.010 $0.026
judgment 0.38 0.42 0.45 $0.002 $0.004 $0.012
error_handling 0.60 0.71 0.72 $0.005 $0.013 $0.031
enumeration 0.82 0.85 0.86 $0.002 $0.004 $0.010
security 0.48 0.55 0.56 $0.003 $0.007 $0.016
no_answer 0.28 0.30 0.32 $0.002 $0.005 $0.012
failure_handling 0.32 0.38 0.40 $0.003 $0.006 $0.014

Cost Breakdown

Workflow (1 model call per query)

Component Avg Tokens Avg Cost Notes
Retrieval 0 $0.000 Embedding lookup only, no model call
Context assembly 0 $0.000 Deterministic string construction
Model call 620 $0.0016 Single call: 380 prompt + 240 completion
Total per query 620 $0.0016
Total (30 queries) 18,600 $0.047

Single Agent (avg 2.8 model calls per query)

Component Avg Tokens Avg Cost Notes
Retrieval 0 $0.000 Embedding lookup
Initial model call 620 $0.0016 Same as workflow
Refinement calls (avg 1.8) 950 $0.0024 Query refinement + re-retrieval + answer
Total per query 1,570 $0.0039
Total (30 queries) 47,100 $0.118

Multi-Agent (avg 4.6 model calls per query)

Component Avg Tokens Avg Cost Notes
Router call 280 $0.0007 Classify query complexity
Primary agent (avg 2.2 calls) 1,960 $0.0049 Retrieval + reasoning
Verifier agent (avg 1.4 calls) 1,600 $0.0040 Cross-check citations and factual claims
Total per query 3,840 $0.0096
Total (30 queries) 115,200 $0.288

Latency Distribution

Percentile Workflow Single Agent Multi-Agent
P50 840ms 2,180ms 4,620ms
P75 980ms 2,840ms 6,180ms
P90 1,140ms 3,340ms 7,820ms
P95 1,240ms 3,680ms 8,940ms
P99 1,380ms 4,120ms 10,280ms

The multi-agent P95 is 7.2x the workflow P95. For a user-facing application with a 3-second SLA, multi-agent is not viable without caching or pre-computation. Single-agent fits within a 4-second SLA. Workflow fits comfortably within any reasonable SLA.

Verdict

For the Document Intelligence Agent task:

  • Use a workflow for simple, single-source questions (60% of real queries). These are lookup queries with clear vocabulary overlap. The workflow handles them at 1/3 the latency and 1/3 the cost of the single agent, with no accuracy penalty.

  • Use a single agent for multi-hop or refinement-needed queries (30%). These are technical detail and error handling queries where the first retrieval might miss. The agent's ability to refine its query and re-retrieve justifies the 2.6x cost increase over the workflow.

  • Use multi-agent only for high-stakes queries where verification justifies the 2.4x cost premium over single-agent (10%). Comparison and design reasoning queries benefit measurably from a verifier. Everything else does not.

  • The hybrid approach (workflow default, agent escalation) outperforms any single architecture. Route simple queries through the workflow. Escalate to the single agent when the workflow's confidence is low. Escalate to multi-agent only for explicitly flagged high-value queries. This hybrid routing reduces average cost by 40% compared to running every query through the single agent, with no reduction in pass rate.

What This Comparison Does Not Show

This comparison holds the model constant (gpt-4o for all architectures). In practice, the workflow could use a cheaper model (gpt-4o-mini) for simple queries, reducing the cost gap further. The single agent could route its refinement calls through a cheaper model. These model-routing optimizations are covered in Chapter 6's cost management section but are not reflected in these numbers.

The comparison also holds the dataset constant. In production, the query distribution matters enormously. If 90% of your queries are simple lookups, the workflow is the clear winner. If 50% of your queries require cross-document synthesis, multi-agent starts to justify its cost. Know your query distribution before choosing an architecture.