🤖 AI Summary
In partially observable multi-agent reinforcement learning (MARL), reliance solely on sparse rewards leads to inefficient and unstable training. To address this, we propose BEPAL (Belief-based Predictive Auxiliary Learning), a novel framework operating under the centralized training with decentralized execution (CTDE) paradigm. BEPAL incorporates multi-task learning, jointly optimizing policies while predicting unobservable latent states—such as teammates’ rewards and behavioral intentions—via explicit belief modeling to enhance hidden-state representation. This auxiliary prediction improves information aggregation efficiency and policy robustness. Empirical evaluation on the Predator-Prey and Google Research Football benchmarks demonstrates that BEPAL achieves an average performance gain of 16% over state-of-the-art baselines, exhibits faster and more stable convergence, and significantly mitigates training instability induced by reward sparsity.
📝 Abstract
The performance of multi-agent reinforcement learning (MARL) in partially observable environments depends on effectively aggregating information from observations, communications, and reward signals. While most existing multi-agent systems primarily rely on rewards as the only feedback for policy training, our research shows that introducing auxiliary predictive tasks can significantly enhance learning efficiency and stability. We propose Belief-based Predictive Auxiliary Learning (BEPAL), a framework that incorporates auxiliary training objectives to support policy optimization. BEPAL follows the centralized training with decentralized execution paradigm. Each agent learns a belief model that predicts unobservable state information, such as other agents' rewards or motion directions, alongside its policy model. By enriching hidden state representations with information that does not directly contribute to immediate reward maximization, this auxiliary learning process stabilizes MARL training and improves overall performance. We evaluate BEPAL in the predator-prey environment and Google Research Football, where it achieves an average improvement of about 16 percent in performance metrics and demonstrates more stable convergence compared to baseline methods.