Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-agent systems, a unified large language model (LLM) struggles to accommodate heterogeneous agents’ divergent data distributions and execution frequencies; moreover, cross-server deployment disrupts end-to-end gradient flow and impairs credit assignment. Method: We propose M-GRPO (Hierarchical Group Relative Policy Optimization), which enables decoupled, co-training of agent-specific LLMs via heterogeneous trajectory alignment and hierarchical credit assignment. We further design a decentralized architecture with minimal sufficient statistics sharing to support efficient cross-server optimization. Contribution/Results: Evaluated on GAIA, XBench-DeepSearch, and WebWalkerQA, M-GRPO significantly outperforms single-agent and frozen-subagent multi-agent baselines—improving tool-augmented reasoning accuracy, stability, and sample efficiency. To our knowledge, it is the first framework to systematically resolve gradient fragmentation and optimization inconsistency in heterogeneous multi-agent training.

Technology Category

Application Category

📝 Abstract
Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Training multi-agent systems with distinct LLMs faces optimization challenges
Variable sub-agent invocation frequencies disrupt standard gradient flow
Separate agent deployment across servers prevents end-to-end training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical credit assignment for multi-agent systems
Trajectory alignment with variable sub-agent invocations
Decoupled training pipeline across separate servers
🔎 Similar Papers
No similar papers found.
H
Haoyang Hong
Imperial College London
J
Jiajun Yin
Ant Group
Y
Yuan Wang
Ant Group
J
Jingnan Liu
Ant Group
Z
Zhe Chen
Ant Group
A
Ailing Yu
Ant Group
Ji Li
Ji Li
Principal Group Science Manager at Microsoft
AICAD
Z
Zhiling Ye
Ant Group
H
Hansong Xiao
Ant Group
Y
Yefei Chen
Ant Group
H
Hualei Zhou
Ant Group
Yun Yue
Yun Yue
Ant Group
AImachine learning
Minghui Yang
Minghui Yang
Ant Group
NLPDialogueGraph3DV
C
Chunxiao Guo
Ant Group
J
Junwei Liu
Ant Group
P
Peng Wei
Ant Group
Jinjie Gu
Jinjie Gu
ant group
机器学习,推荐