🤖 AI Summary
To address the limitations of shallow reasoning depth in open-source large language models (LLMs) and poor generalization in multi-agent systems, this paper proposes a reinforcement learning (RL)-based multi-agent collaborative reasoning framework leveraging pipeline parallelism. The framework decouples agent roles into Solver, Verifier, and Corrector, and introduces a role-specific, verifiable reward mechanism—Reinforcement Learning with Verifiable Rewards (RLVR)—to substantially reduce reward noise in long-horizon training. Furthermore, it incorporates test-time expansion and pipeline-style joint training to enhance both inference efficiency and policy consistency. To our knowledge, this is the first work to demonstrate that multi-agent collaborative reasoning on an open-source model—Qwen3-30B—outperforms state-of-the-art closed-source systems: accuracy on AIME2025 improves from 86.5% to 93.3%, and on BeyondAIME from 64.9% to 73.8%. These results validate the method’s effectiveness and strong generalization capability on complex, multi-step reasoning tasks.
📝 Abstract
Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.