TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing test-time scaling methods suffer from insufficient coordination among parallel reasoning trajectories, hindering an effective balance between exploration and exploitation and thereby limiting further gains in reasoning performance. This work proposes a multi-agent collaborative reasoning framework that enables efficient cooperation across agents, trajectories, and iterations through a structured information flow mechanism. A hierarchical memory system—comprising an experience bank and a guidance bank—is introduced to facilitate knowledge reuse. Furthermore, a hybrid reward-based reinforcement learning strategy tailored for multi-agent collaboration is developed to optimize the exploration–exploitation trade-off. The proposed approach significantly outperforms current test-time scaling methods on multiple complex reasoning benchmarks, demonstrating superior iterative scaling capability and cross-round stability.

📝 Abstract

Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test-time compute via multi-agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross-trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low-level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks demonstrate that TMAS achieves stronger iterative scaling than existing test-time scaling baselines, while hybrid reward training further improves scaling effectiveness and stability across iterations. Code and data are available at https://github.com/george-QF/TMAS-code.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

reasoning trajectories

exploration-exploitation trade-off

structured inference

multi-agent collaboration

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling

multi-agent synergy

hierarchical memory