MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of shallow reasoning depth in open-source large language models (LLMs) and poor generalization in multi-agent systems, this paper proposes a reinforcement learning (RL)-based multi-agent collaborative reasoning framework leveraging pipeline parallelism. The framework decouples agent roles into Solver, Verifier, and Corrector, and introduces a role-specific, verifiable reward mechanism—Reinforcement Learning with Verifiable Rewards (RLVR)—to substantially reduce reward noise in long-horizon training. Furthermore, it incorporates test-time expansion and pipeline-style joint training to enhance both inference efficiency and policy consistency. To our knowledge, this is the first work to demonstrate that multi-agent collaborative reasoning on an open-source model—Qwen3-30B—outperforms state-of-the-art closed-source systems: accuracy on AIME2025 improves from 86.5% to 93.3%, and on BeyondAIME from 64.9% to 73.8%. These results validate the method’s effectiveness and strong generalization capability on complex, multi-step reasoning tasks.

Technology Category

Application Category

📝 Abstract
Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Limited reasoning depth in single LLM inference due to output length constraints
Multi-agent reasoning systems struggle with generalization in open-source models
Insufficient critic and correction capabilities in current multi-agent approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning with agentic pipeline parallelism
Agent-specific reward mechanisms to reduce noise
Pipeline-inspired training for long trajectory efficiency
🔎 Similar Papers
No similar papers found.
S
Shulin Liu
Tencent Hunyuan Team
Dong Du
Dong Du
Associate Professor, Nanjing University of Science and Technology
Computer Graphics3D Computer Vision
T
Tao Yang
Tencent Hunyuan Team
Y
Yang Li
Tencent Hunyuan Team
B
Boyu Qiu
Tencent Hunyuan Team