Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

In multi-agent systems, a unified large language model (LLM) struggles to accommodate heterogeneous agents’ divergent data distributions and execution frequencies; moreover, cross-server deployment disrupts end-to-end gradient flow and impairs credit assignment. Method: We propose M-GRPO (Hierarchical Group Relative Policy Optimization), which enables decoupled, co-training of agent-specific LLMs via heterogeneous trajectory alignment and hierarchical credit assignment. We further design a decentralized architecture with minimal sufficient statistics sharing to support efficient cross-server optimization. Contribution/Results: Evaluated on GAIA, XBench-DeepSearch, and WebWalkerQA, M-GRPO significantly outperforms single-agent and frozen-subagent multi-agent baselines—improving tool-augmented reasoning accuracy, stability, and sample efficiency. To our knowledge, it is the first framework to systematically resolve gradient fragmentation and optimization inconsistency in heterogeneous multi-agent training.

Technology Category

Application Category

📝 Abstract

Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Training multi-agent systems with distinct LLMs faces optimization challenges

Variable sub-agent invocation frequencies disrupt standard gradient flow

Separate agent deployment across servers prevents end-to-end training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical credit assignment for multi-agent systems

Trajectory alignment with variable sub-agent invocations

Decoupled training pipeline across separate servers

🔎 Similar Papers

No similar papers found.