Heterogeneous Agent Collaborative Reinforcement Learning

πŸ“… 2026-03-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the low sample efficiency and limited knowledge transfer arising from isolated training of heterogeneous agents in reinforcement learning. To overcome these challenges, we propose a novel collaborative reinforcement learning paradigm built upon the HACPO algorithm, incorporating four key mechanisms to ensure unbiased advantage estimation and correct policy optimization. Our approach enables bidirectional knowledge exchange and shared experience trajectories among agents with diverse architectures during training, while preserving independent execution at inference timeβ€”thus balancing collaborative gains with deployment flexibility. Experimental results across various heterogeneous model combinations demonstrate consistent and significant performance improvements for all agents, achieving an average gain of 3.3% over GSPO while requiring only half the rollout cost.

Technology Category

Application Category

πŸ“ Abstract
We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3\% while using only half the rollout cost.
Problem

Research questions and friction points this paper is trying to address.

Heterogeneous Agents
Collaborative Reinforcement Learning
Sample Efficiency
Policy Optimization
Multi-Agent Reinforcement Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous Agent
Collaborative Reinforcement Learning
Bidirectional Knowledge Transfer
Rollout Sharing
HACPO
πŸ”Ž Similar Papers
No similar papers found.