Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited collaborative reasoning capabilities of multi-agent systems in complex real-world tasks, this paper proposes an adaptive multi-agent framework integrating model-level collaborative training with system-level dynamic coordination. Key contributions include: (1) the first CEO agent mechanism, which dynamically regulates discussion protocols and reasoning depth; (2) the construction of M500, a high-quality multi-agent reasoning dataset; and (3) the first test-time scaling (TTS)-enabled cooperative optimization paradigm. Our M1-32B model—fine-tuned from Qwen2.5-32B-Instruct—achieves +12%, +41%, and +10% absolute improvements on GPQA-Diamond, AIME2024, and MBPP-Sanitized, respectively, matching the performance of state-of-the-art monolithic models such as DeepSeek-R1. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Multi-agent systems (MAS) built on large language models (LLMs) offer a promising path toward solving complex, real-world tasks that single-agent systems often struggle to manage. While recent advancements in test-time scaling (TTS) have significantly improved single-agent performance on challenging reasoning tasks, how to effectively scale collaboration and reasoning in MAS remains an open question. In this work, we introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination. We construct M500, a high-quality dataset containing 500 multi-agent collaborative reasoning traces, and fine-tune Qwen2.5-32B-Instruct on this dataset to produce M1-32B, a model optimized for multi-agent collaboration. To further enable adaptive reasoning, we propose a novel CEO agent that dynamically manages the discussion process, guiding agent collaboration and adjusting reasoning depth for more effective problem-solving. Evaluated in an open-source MAS across a range of tasks-including general understanding, mathematical reasoning, and coding-our system significantly outperforms strong baselines. For instance, M1-32B achieves 12% improvement on GPQA-Diamond, 41% on AIME2024, and 10% on MBPP-Sanitized, matching the performance of state-of-the-art models like DeepSeek-R1 on some tasks. These results highlight the importance of both learned collaboration and adaptive coordination in scaling multi-agent reasoning. Code is available at https://github.com/jincan333/MAS-TTS
Problem

Research questions and friction points this paper is trying to address.

How to scale multi-agent collaboration for complex reasoning tasks
Enhancing collaborative reasoning via adaptive model and system coordination
Improving performance on diverse tasks through dynamic agent management
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive multi-agent framework for collaborative reasoning
M1-32B model fine-tuned for multi-agent collaboration
CEO agent dynamically manages discussion and reasoning