Multi-Agent Trust Region Policy Optimisation: A Joint Constraint Approach

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address uncoordinated policy updates and training instability in heterogeneous multi-agent reinforcement learning (MARL), this paper proposes a joint constrained optimization framework that dynamically allocates per-agent KL-divergence trust-region thresholds. Unlike HATRPO, which imposes a uniform KL constraint across all agents, we introduce two adaptive threshold allocation mechanisms: (1) HATRPO-W, an analytically derived solution grounded in the Karush–Kuhn–Tucker (KKT) optimality conditions; and (2) HATRPO-G, a greedy scheduling algorithm guided by an improvement-to-divergence ratio criterion. Both methods enforce coordinated policy updates under a global KL budget. Empirical results demonstrate that HATRPO-W and HATRPO-G outperform baseline HATRPO by over 22.5% in average task performance. Notably, HATRPO-W achieves faster convergence and significantly enhanced training stability. These findings underscore the critical role of dynamic, agent-specific trust-region adaptation in improving both efficiency and robustness of heterogeneous MARL training.

Technology Category

Application Category

📝 Abstract

Multi-agent reinforcement learning (MARL) requires coordinated and stable policy updates among interacting agents. Heterogeneous-Agent Trust Region Policy Optimization (HATRPO) enforces per-agent trust region constraints using Kullback-Leibler (KL) divergence to stabilize training. However, assigning each agent the same KL threshold can lead to slow and locally optimal updates, especially in heterogeneous settings. To address this limitation, we propose two approaches for allocating the KL divergence threshold across agents: HATRPO-W, a Karush-Kuhn-Tucker-based (KKT-based) method that optimizes threshold assignment under global KL constraints, and HATRPO-G, a greedy algorithm that prioritizes agents based on improvement-to-divergence ratio. By connecting sequential policy optimization with constrained threshold scheduling, our approach enables more flexible and effective learning in heterogeneous-agent settings. Experimental results demonstrate that our methods significantly boost the performance of HATRPO, achieving faster convergence and higher final rewards across diverse MARL benchmarks. Specifically, HATRPO-W and HATRPO-G achieve comparable improvements in final performance, each exceeding 22.5%. Notably, HATRPO-W also demonstrates more stable learning dynamics, as reflected by its lower variance.

Problem

Research questions and friction points this paper is trying to address.

Optimizing multi-agent policy updates with joint constraints

Addressing heterogeneous-agent trust region threshold allocation

Improving MARL convergence and reward via flexible KL divergence scheduling

Innovation

Methods, ideas, or system contributions that make the work stand out.

HATRPO-W optimizes KL thresholds via KKT method

HATRPO-G greedily assigns thresholds by improvement ratio

Joint constraint scheduling enhances heterogeneous-agent learning

🔎 Similar Papers

No similar papers found.

Authors to Follow