The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study investigates the effectiveness and failure mechanisms of homogeneous multi-agent unguided debate in mitigating hallucinations in large language models. By benchmarking against isolated self-correction and random noise injection on GSM-Hard and MMLU-Hard, and systematically analyzing the impact of communication density and sampling temperature, the work identifies three primary failure modes of this debate paradigm: sycophantic conformity, contextual fragility, and consensus collapse. Experimental results demonstrate that, despite consuming 2.1–3.4 times more tokens than self-correction, multi-agent debate fails to improve accuracy. In contrast, isolated self-correction consistently achieves superior cost-accuracy trade-offs across all evaluated configurations.

📝 Abstract

Multi-agent debate, where teams of LLMs iteratively exchange rationales and vote on answers, is widely deployed under the assumption that peer review filters hallucinations. Yet the failure dynamics of homogeneous debate remain poorly understood, therefore we report findings from a controlled empirical study of teams of $N{=}10$ homogeneous agents (Qwen2.5-7B, Llama-3.1-8B, Ministral-3-8B) across $R{=}3$ debate rounds on two high-difficulty benchmarks (GSM-Hard and MMLU-Hard). We compare peer debate against isolated self-correction and a stochastic noise control that injects rationales from unrelated problems. We decompose debate failure into three model-dependent pathways: sycophantic conformity, where agents uncritically adopt majority answers (modal adoption up to 85.5%); contextual fragility, where peer rationales destabilize previously correct reasoning (vulnerability rate up to 70.0%); and consensus collapse, where plurality voting discards correct answers already present in the generation pool (oracle gap up to 32.3 percentage points). Ablations over communication density ($K \in \{2,4,9\}$) and sampling temperature ($T \in \{0.4, 0.7\}$) show that conformity reaches high levels at minimal peer exposure ($K{=}2$) and intensifies with greater initial diversity. Across all configurations, debate consumes 2.1-3.4$\times$ more tokens (up to 28,631 tokens per problem) than self-correction for equal or lower accuracy. Our results indicate that, within the 7-8B parameter class, homogeneous teams without structured roles do not benefit from unguided peer exchange, and that isolated self-correction consistently offers a more favorable cost-accuracy tradeoff.

Problem

Research questions and friction points this paper is trying to address.

multi-agent debate

homogeneous agents

hallucination

self-correction

consensus failure

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent debate

self-correction

hallucination mitigation