← Research

From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts

Sunil Prakash · 2026 · Read the full paper

Multi-agent debate is the dominant approach for collective LLM reasoning, but it discards disagreements, lacks convergence guarantees, and scales poorly. Deliberative Collective Intelligence (DCI) introduces typed reasoning moves (assert, challenge, refine, synthesize), preserved disagreements as first-class objects, and a convergence algorithm that guarantees termination. This page walks through the four hypotheses and their evidence.

Experiment Setup (Section 6)

45 tasks across 7 domains (software architecture, policy analysis, hidden-profile, late-evidence, risk analysis, disagreement-heavy, and routine as negative control). Five systems compared: DCI (4 archetyped delegates + typed grammar + DCI-CF convergence), Single Agent (one LLM with careful instructions), Unstructured Debate (4 LLMs, free-form), Simple Voting (4 independent answers, judge picks best), and Self-Consistency (multiple reasoning paths, best selected). All use the same model. Evaluated blind by Gemini 3 Flash Preview as LLM-as-judge on quality (1–10), risk identification, reasoning depth, and actionability. Bootstrap 95% CIs (10,000 resamples), Wilcoxon signed-rank tests, Holm–Šidák correction.

Structured Deliberation vs Unstructured Debate Key Finding

H1: Structured deliberation improves over unstructured multi-agent debate. If the structure of multi-agent interaction matters, DCI should outperform free-form debate among the same number of agents on the same tasks.

The core test: same model, same number of agents (4), same tasks. The only difference is the presence of typed acts, phased sessions, a shared workspace, and a convergence algorithm. Does adding structure to the conversation actually help?

"On the full task set (n=45), DCI scores 0.49 points higher than unstructured debate on overall quality (8.24 vs 7.75). However, bootstrap 95% confidence intervals show this overall difference is not statistically significant: Δ = +0.49, 95% CI [−0.10, +1.12]. The reason is informative: routine tasks drag down DCI's average substantially (5.39 on routine vs 8.58 for debate)." — Section 7.1, H1

The surprise: On the full task set, H1 is not supported. But filter out routine tasks — tasks that don't need multi-agent deliberation — and the picture changes dramatically.

System

Overall Quality

DCI

8.24

Single Agent

8.84

Voting

8.78

Self-Consistency

8.65

Debate

7.75

Figure 1. Overall quality across systems. Toggle to see the effect of excluding routine tasks. On non-routine tasks, DCI–Debate = +0.95, 95% CI [+0.41, +1.54] — statistically significant.

On non-routine tasks (n=40), DCI scores +0.95 over debate, 95% CI [+0.41, +1.54] — statistically significant. Same agents, same model, same tasks — the only difference is deliberative structure. But single-agent generation (8.84) still outperforms DCI overall (Δ = −0.60, CI [−1.06, −0.15]), as do voting and self-consistency.

Domain-Specific Value Surprising Result

H2: DCI especially helps on tasks requiring perspective integration and multi-stakeholder analysis. Its advantage should be largest on hidden-profile, risk-heavy, and policy tasks — and smallest (or negative) on routine tasks.

This is where DCI's story gets interesting. The framework was designed for tasks where no single perspective has the full picture — where you need multiple viewpoints to integrate fragmented knowledge. The seven domains test whether this design intent matches reality.

Domain

DCI

Single

Debate

Hidden-Profile

9.56

9.25

9.03

Late-Evidence

9.24

9.60

8.45

Risk Analysis

8.48

8.85

7.83

Disagreement

8.15

8.87

8.24

Policy

8.55

8.82

8.03

Architecture

8.13

8.73

5.78

Routine

5.39

7.88

8.58

Figure 2. Quality by domain. Click a domain to see the detail. DCI achieves 9.56 on hidden-profile (best in study) but 5.39 on routine (worst in study).

"DCI's strongest domain is hidden-profile tasks (9.56)—the highest score of any system on any domain. Hidden-profile tasks require integrating partial information that no single perspective possesses, and this is exactly where differentiated delegates and structured engagement should help." — Section 7.2, H2

Hidden-profile tasks (9.56): The only domain where DCI beats the single agent (+0.31, CI [+0.12, +0.49]). Also significantly outperforms self-consistency (+0.48) and debate (+0.53).

Process-sensitive tasks (architecture + policy, n=20): DCI–Debate = +1.44, CI [+0.57, +2.43] — statistically significant. On architecture specifically, debate degrades severely (5.78) while DCI maintains 8.13 — a gap of +2.36.

Routine tasks (5.39): Significantly lower than every baseline (DCI–Debate: −3.19, CI [−4.25, −2.11]). The deliberative machinery actively harms output quality on straightforward tasks.

"DCI scores 5.39 on routine tasks—significantly lower than every baseline. This negative control confirms strong task-dependence: DCI's deliberative machinery actively harms output quality on straightforward tasks." — Section 7.2, H2 (routine negative control)

Coordination Overhead

H3: DCI incurs substantial coordination overhead and is not efficient for routine tasks. If this overhead is real, DCI should consume substantially more tokens than simpler approaches.

Structured deliberation requires multi-stage, multi-agent interaction. Four delegates engaging through typed acts, building shared workspaces, running convergence rounds. How much does this cost?

System

Mean Tokens

Quality

Quality/kToken

Single Agent

3,809

8.84

2.320

Debate

9,458

7.75

0.819

Self-Consistency

21,249

8.65

0.407

Voting

31,987

8.78

0.274

DCI

237,565

8.24

0.035

Figure 3. Token consumption vs quality. DCI uses ~62× the tokens of a single agent for 0.60 points lower quality overall. In quality-per-token terms, single-agent generation dominates (2.320 vs 0.035).

"DCI consumes approximately 62× the tokens of a single agent for an overall quality score that is 0.60 points lower—a gap that is now statistically significant. The right question is not 'is DCI efficient?' (it is not) but 'does the task require accountable, auditable deliberation?'" — Section 7.3, H3

H3 is strongly supported. But the paper argues the cost is justified when three properties matter that cheaper alternatives cannot provide:

Decision packets. Every DCI session produces a structured artifact: selected option, residual objections, minority report, and reopen conditions. DCI achieves 100% decision packet completeness and 98% minority report presence; no baseline exceeds 16% on either metric.
Hidden-profile integration. On tasks requiring combination of partial perspectives, DCI achieves 9.56 — the highest score of any system on any domain.
Process-sensitive performance. On architecture + policy tasks, DCI significantly outperforms debate (+1.44) while providing dissent artifacts that single-agent output cannot.

Component Contributions

H4: Different DCI components matter differently across task classes. Removing archetypes, typed grammar, or the convergence algorithm should produce different effects.

DCI has three main components: archetypes (delegates with specialized perspectives like "security auditor" or "systems architect"), a typed grammar (14 epistemic acts: assert, challenge, refine, synthesize, etc.), and DCI-CF (the convergent flow algorithm that guarantees termination). What happens when you remove each one?

Condition

Quality

Δ vs Full

Full DCI

8.24

—

No Archetypes

8.61 ± 0.68

+0.37

No Typed Grammar

8.32 ± 1.47

+0.08

No DCI-CF

8.33 ± 0.91

+0.09

Figure 4. Ablation study (n=25 per condition). Removing any component produces equal or higher mean scores. None of the differences are statistically significant.

H4 is not supported. All three ablation conditions scored at or above the full framework. Removing archetypes (+0.37), typed grammar (+0.08), or the convergence algorithm (+0.09) each produced equal or higher mean scores. Bootstrap confidence intervals confirm none of these differences are significant.

"The No Typed Grammar condition shows the highest variance (±1.47), suggesting that the typed grammar acts as a variance reducer even when it does not improve the mean—removing it yields occasional high and low outliers." — Section 7.4, H4

The paper identifies three non-exclusive explanations: (1) moderate sample with high variance (n=25 per condition), (2) coordination overhead — the multi-stage pipeline can amplify errors, and (3) over-constraint — the typed grammar may constrain generation for some tasks where free-form communication works better.

Citation & Links

This interactive paper covers all four hypotheses from the paper. The full paper includes additional detail on the 14-act interaction grammar, the DCI-CF convergence algorithm with formal proof, cross-judge validation, and connections to AI safety.

Paper arXiv:2603.11781 Blog post Non-technical overview Experiment code Reproduction scripts & data Related: LDP Protocol layer (complements DCI)

BibTeX

@article{prakash2026dci,
  title={From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts},
  author={Prakash, Sunil},
  journal={arXiv preprint arXiv:2603.11781},
  year={2026}
}

Google Scholar ORCID All research