🤖 AI Summary
In safety-critical multi-agent reinforcement learning (MARL), insufficient exploration under constraints often leads to coordination failure. To address this, we propose E2C (Entropy-driven Exploration under Constraints), a novel framework that jointly models team-level safety constraints and explicitly encourages exploration via observation entropy maximization as a regularizer—optimized concurrently with safety objectives. E2C integrates the centralized training with decentralized execution (CTDE) paradigm and extends PPO and SAC to support constrained multi-agent policy optimization. Theoretically, we establish convergence guarantees under entropy-regularized constrained optimization. Empirically, E2C achieves comparable or superior performance to both unconstrained and conventional constrained baselines across multiple complex cooperative benchmarks; unsafe actions are reduced by up to 50%. Moreover, E2C significantly improves exploration efficiency under constraints and enhances policy robustness, demonstrating strong generalization in safety-critical MARL settings.
📝 Abstract
Many real-world multiagent learning problems involve safety concerns. In these setups, typical safe reinforcement learning algorithms constrain agents' behavior, limiting exploration -- a crucial component for discovering effective cooperative multiagent behaviors. Moreover, the multiagent literature typically models individual constraints for each agent and has yet to investigate the benefits of using joint team constraints. In this work, we analyze these team constraints from a theoretical and practical perspective and propose entropic exploration for constrained multiagent reinforcement learning (E2C) to address the exploration issue. E2C leverages observation entropy maximization to incentivize exploration and facilitate learning safe and effective cooperative behaviors. Experiments across increasingly complex domains show that E2C agents match or surpass common unconstrained and constrained baselines in task performance while reducing unsafe behaviors by up to $50%$.