Interactive Paper · arXiv:2603.11781
From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts
Multi-agent debate is the dominant approach for collective LLM reasoning, but it discards disagreements, lacks convergence guarantees, and scales poorly. Deliberative Collective Intelligence (DCI) introduces typed reasoning moves (assert, challenge, refine, synthesize), preserved disagreements as first-class objects, and a convergence algorithm that guarantees termination. This page walks through the four hypotheses and their evidence.
Experiment Setup (Section 6)
45 tasks across 7 domains (software architecture, policy analysis, hidden-profile, late-evidence, risk analysis, disagreement-heavy, and routine as negative control). Five systems compared: DCI (4 archetyped delegates + typed grammar + DCI-CF convergence), Single Agent (one LLM with careful instructions), Unstructured Debate (4 LLMs, free-form), Simple Voting (4 independent answers, judge picks best), and Self-Consistency (multiple reasoning paths, best selected). All use the same model. Evaluated blind by Gemini 3 Flash Preview as LLM-as-judge on quality (1–10), risk identification, reasoning depth, and actionability. Bootstrap 95% CIs (10,000 resamples), Wilcoxon signed-rank tests, Holm–Šidák correction.
Structured Deliberation vs Unstructured Debate Key Finding
H1: Structured deliberation improves over unstructured multi-agent debate. If the structure of multi-agent interaction matters, DCI should outperform free-form debate among the same number of agents on the same tasks.
The core test: same model, same number of agents (4), same tasks. The only difference is the presence of typed acts, phased sessions, a shared workspace, and a convergence algorithm. Does adding structure to the conversation actually help?
The surprise: On the full task set, H1 is not supported. But filter out routine tasks — tasks that don't need multi-agent deliberation — and the picture changes dramatically.
On non-routine tasks (n=40), DCI scores +0.95 over debate, 95% CI [+0.41, +1.54] — statistically significant. Same agents, same model, same tasks — the only difference is deliberative structure. But single-agent generation (8.84) still outperforms DCI overall (Δ = −0.60, CI [−1.06, −0.15]), as do voting and self-consistency.
Domain-Specific Value Surprising Result
H2: DCI especially helps on tasks requiring perspective integration and multi-stakeholder analysis. Its advantage should be largest on hidden-profile, risk-heavy, and policy tasks — and smallest (or negative) on routine tasks.
This is where DCI's story gets interesting. The framework was designed for tasks where no single perspective has the full picture — where you need multiple viewpoints to integrate fragmented knowledge. The seven domains test whether this design intent matches reality.
Hidden-profile tasks (9.56): The only domain where DCI beats the single agent (+0.31, CI [+0.12, +0.49]). Also significantly outperforms self-consistency (+0.48) and debate (+0.53).
Process-sensitive tasks (architecture + policy, n=20): DCI–Debate = +1.44, CI [+0.57, +2.43] — statistically significant. On architecture specifically, debate degrades severely (5.78) while DCI maintains 8.13 — a gap of +2.36.
Routine tasks (5.39): Significantly lower than every baseline (DCI–Debate: −3.19, CI [−4.25, −2.11]). The deliberative machinery actively harms output quality on straightforward tasks.
Coordination Overhead
H3: DCI incurs substantial coordination overhead and is not efficient for routine tasks. If this overhead is real, DCI should consume substantially more tokens than simpler approaches.
Structured deliberation requires multi-stage, multi-agent interaction. Four delegates engaging through typed acts, building shared workspaces, running convergence rounds. How much does this cost?
H3 is strongly supported. But the paper argues the cost is justified when three properties matter that cheaper alternatives cannot provide:
- Decision packets. Every DCI session produces a structured artifact: selected option, residual objections, minority report, and reopen conditions. DCI achieves 100% decision packet completeness and 98% minority report presence; no baseline exceeds 16% on either metric.
- Hidden-profile integration. On tasks requiring combination of partial perspectives, DCI achieves 9.56 — the highest score of any system on any domain.
- Process-sensitive performance. On architecture + policy tasks, DCI significantly outperforms debate (+1.44) while providing dissent artifacts that single-agent output cannot.
Component Contributions
H4: Different DCI components matter differently across task classes. Removing archetypes, typed grammar, or the convergence algorithm should produce different effects.
DCI has three main components: archetypes (delegates with specialized perspectives like "security auditor" or "systems architect"), a typed grammar (14 epistemic acts: assert, challenge, refine, synthesize, etc.), and DCI-CF (the convergent flow algorithm that guarantees termination). What happens when you remove each one?
H4 is not supported. All three ablation conditions scored at or above the full framework. Removing archetypes (+0.37), typed grammar (+0.08), or the convergence algorithm (+0.09) each produced equal or higher mean scores. Bootstrap confidence intervals confirm none of these differences are significant.
The paper identifies three non-exclusive explanations: (1) moderate sample with high variance (n=25 per condition), (2) coordination overhead — the multi-stage pipeline can amplify errors, and (3) over-constraint — the typed grammar may constrain generation for some tasks where free-form communication works better.
Citation & Links
This interactive paper covers all four hypotheses from the paper. The full paper includes additional detail on the 14-act interaction grammar, the DCI-CF convergence algorithm with formal proof, cross-judge validation, and connections to AI safety.
BibTeX
@article{prakash2026dci,
title={From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts},
author={Prakash, Sunil},
journal={arXiv preprint arXiv:2603.11781},
year={2026}
}