Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs

πŸ“… 2025-10-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Multi-agent large language models (LLMs) struggle with effective collaboration under dynamic role switching and heterogeneous dialogue turns; conventional on-policy reinforcement learning (e.g., GRPO) assumes static agent grouping, leading to suboptimal policy optimization. Method: We propose AT-GRPOβ€”the first on-policy RL framework enabling online agent grouping along *both* agent identity and dialogue turn dimensions. It unifies single- and multi-policy training architectures, integrating role-aware multi-agent systems, GRPO-style online policy updates, distributed training, and workflow replay. Contribution/Results: Experiments demonstrate substantial gains across long-horizon planning, code generation, and mathematical reasoning: planning accuracy improves from 14% to 99.5%, while reasoning tasks achieve average performance gains of 3.87%–17.93%.

Technology Category

Application Category

πŸ“ Abstract
Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.
Problem

Research questions and friction points this paper is trying to address.

Applying on-policy reinforcement learning to multi-agent LLM systems
Addressing algorithmic challenges in role-based prompt variations
Developing training systems supporting multi-agent workflow rollouts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent- and turn-wise grouped RL algorithm for MAS
Training system supporting single- and multi-policy regimes
On-policy reinforcement learning for collaborative language models
πŸ”Ž Similar Papers
No similar papers found.
Y
Yujie Zhao
University of California, San Diego
Lanxiang Hu
Lanxiang Hu
University of California, San Diego
Machine LearningDistributed SystemsEmbedded Systems
Y
Yang Wang
Intel Corporation
M
Minmin Hou
Intel Corporation
H
Hao Zhang
University of California, San Diego
K
Ke Ding
Intel Corporation
Jishen Zhao
Jishen Zhao
Professor at University of California, San Diego
Computer ArchitectureComputer SystemsMachine Learning SystemsElectronic Design Automation