Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-agent debate (MAD) has emerged as a test-time scaling technique, yet its effectiveness remains poorly understood—particularly how task difficulty, model scale, and agent diversity condition its performance across domains. Method: This work formally frames MAD as a test-time scaling paradigm and conducts a systematic evaluation on mathematical reasoning (GSM8K, MATH) and safety benchmarking (Red-Teaming), introducing controllable diversity configurations and comparing against self-refinement and tree-of-thought baselines. Contribution/Results: MAD gains are highly conditional: in mathematics, improvements are significant only on hard problems or with smaller models (+12% accuracy max); in red-teaming, controlled diversity mitigates collaboration-induced vulnerabilities, reducing attack success by 37%. The study reveals MAD’s dual nature—beneficial under specific conditions yet potentially detrimental otherwise—and proposes principled, controllable optimization strategies. It establishes the first empirical benchmark and formal framework for MAD as test-time scaling, advancing both theoretical understanding and practical deployment.

Technology Category

Application Category

📝 Abstract
The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi-agent systems, with debate frameworks emerging as a promising avenue for enhanced problem-solving. These multi-agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models. Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to self-agent methods, particularly under varying conditions, remains elusive. This paper seeks to fill this gap by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities. We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks. Our study systematically examines the influence of task difficulty, model scale, and agent diversity on MAD's performance. Key findings reveal that, for mathematical reasoning, MAD offers limited advantages over self-agent scaling but becomes more effective with increased problem difficulty and decreased model capability, while agent diversity shows little benefit. Conversely, for safety tasks, MAD's collaborative refinement can increase vulnerability, but incorporating diverse agent configurations facilitates a gradual reduction in attack success through the collaborative refinement process. We believe our findings provide critical guidance for the future development of more effective and strategically deployed MAD systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-agent debate effectiveness versus self-agent methods
Assessing impact of task difficulty and model scale on MAD
Analyzing agent diversity effects in safety and reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent debate as test-time scaling technique
Collaborative refinement for diverse problem-solving
Empirical study on task difficulty and model scale