🤖 AI Summary
This work addresses the challenges of low sample efficiency and coordination difficulty in multi-agent reinforcement learning, which primarily stem from credit assignment under non-stationary environments and biased individual advantage estimation. To overcome these issues, the paper proposes the Generalized Per-Agent Advantage Estimator (GPAE), which pioneers an indirect estimation of individual advantages based on action probabilities, thereby circumventing explicit modeling of the joint Q-function. GPAE introduces a dual-truncated importance sampling ratio mechanism that effectively balances sensitivity and robustness, and integrates a per-agent value iteration operator to enable stable and efficient off-policy learning. Experimental results on standard multi-agent benchmarks demonstrate that GPAE significantly outperforms existing methods, exhibiting superior collaborative performance and markedly improved sample efficiency.
📝 Abstract
In this paper, we propose a novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation. The core of our approach is Generalized Per-Agent Advantage Estimator (GPAE), which employs a per-agent value iteration operator to compute precise per-agent advantages. This operator enables stable off-policy learning by indirectly estimating values via action probabilities, eliminating the need for direct Q-function estimation. To further refine estimation, we introduce a double-truncated importance sampling ratio scheme. This scheme improves credit assignment for off-policy trajectories by balancing sensitivity to the agent's own policy changes with robustness to non-stationarity from other agents. Experiments on benchmarks demonstrate that our approach outperforms existing approaches, excelling in coordination and sample efficiency for complex scenarios.