🤖 AI Summary
Large language models (LLMs) frequently produce “confident errors” in mathematical reasoning due to insufficient logical rigor. To address this, we propose a multi-agent debate framework leveraging heterogeneous medium-scale models—Gemini-Pro, Mixtral 7B×8, and PaLM 2-M—orchestrated via iterative argumentation and consensus convergence to amplify cognitive diversity. Our work presents the first systematic empirical validation that enhancing cognitive diversity yields greater gains in mathematical reasoning performance than scaling model size alone. With only four debate rounds, our approach surpasses GPT-4, achieving new state-of-the-art results on GSM-8K (91.0%) and ASDiv (94.1%). These findings demonstrate the emergent advantage of collective intelligence in formal reasoning tasks and establish a lightweight, high-reliability paradigm for mathematically grounded inference.
📝 Abstract
Large language models (LLMs) excel in natural language generation but often confidently produce incorrect responses,
especially in tasks like mathematical reasoning. Chain-of-thought prompting, self-verification, and multi-agent debate are
among the strategies proposed to improve the reasoning and factual accuracy of LLMs. Building on Du et al.’s multi-agent
debate framework, we find that multi-agent debate helps at any model scale, and that diversity of thought elicits stronger
reasoning in debating LLMs. Across various model sizes, performance on mathematical reasoning tasks benefits most when
diverse trained models are used. Remarkably, after 4 rounds of debate, a diverse set of medium-capacity models (Gemini-Pro,
Mixtral 7B×8, and PaLM 2-M) outperforms GPT-4 on the GSM-8K benchmark, scoring 91% accuracy. By comparison, when
3 instances of Gemini-Pro are used, performance only reaches 82%. Finally, this diverse set of medium-capacity models sets
a new state-of-the-art performance on the ASDiv benchmark (94%). These results underscore the idea that the future of AI
is agentic, with diverse cooperating agents yielding emergent capabilities beyond even the most powerful individual models.