🤖 AI Summary
Large language model (LLM)-based multi-agent systems face challenges including static knowledge, uncontrolled outputs, and training instability. Existing multi-agent reinforcement learning (MARL) approaches—such as MAPPO—rely on centralized critic networks, leading to poor convergence, high computational overhead, and mandatory warm-up phases. To address these issues, we propose MHGPO, a Critic-free Heterogeneous Group Policy Optimization algorithm. MHGPO introduces the first Critic-free MARL framework for LLM agents; designs three heterogeneous group-wise sampling strategies to jointly optimize efficiency and performance; and incorporates cross-group rollout-based relative advantage estimation to enable stable, warm-up-free training. Evaluated on an LLM-powered multi-agent search task, MHGPO achieves a 12.7% improvement in task performance over MAPPO, accelerates training by 1.8×, and reduces GPU memory consumption by 35%.
📝 Abstract
Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks, yet their deployment in real-world applications is hindered by fixed knowledge cutoffs and difficulties in generating controllable, accurate outputs in a single inference. Multi-agent systems (MAS) built from specialized LLM agents offer a promising solution, enabling dynamic collaboration and iterative reasoning. However, optimizing these systems remains a challenge, as conventional methods such as prompt engineering and supervised fine-tuning entail high engineering overhead and limited adaptability. Reinforcement learning (RL), particularly multi-agent reinforcement learning (MARL), provides a scalable framework by refining agent policies based on system-level feedback. Nevertheless, existing MARL algorithms, such as Multi-Agent Proximal Policy Optimization (MAPPO), rely on Critic networks, which can cause training instability and increase computational burden. To address these limitations and target the prototypical Multi-Agent Search System (MASS), we propose Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), a novel Critic-free algorithm that guides policy updates by estimating relative reward advantages across heterogeneous groups of rollouts. MHGPO eliminates the need for Critic networks, enhancing stability and reducing computational overhead. Additionally, we introduce three group rollout sampling strategies that trade off between efficiency and effectiveness. Experiments on a multi-agent LLM-based search system demonstrate that MHGPO consistently outperforms MAPPO in both task performance and computational efficiency, without requiring warm-up, underscoring its potential for stable and scalable optimization of complex LLM-based MAS.