🤖 AI Summary
Multi-agent reinforcement learning (MARL) lacks formal safety guarantees, hindering its deployment in safety-critical applications. Method: This paper introduces Shielded MARL (SMARL), the first framework extending single-agent probabilistic logic shields (PLS) to decentralized multi-agent settings. It integrates probabilistic logic modeling, temporal-difference (TD) learning, policy gradient optimization, and game-theoretic evaluation. Key technical components include (1) a probabilistic logic temporal-difference (PLTD) update rule for constraint-aware value learning, and (2) a probabilistic logic policy gradient algorithm with provable safety guarantees—ensuring satisfaction of linear temporal logic (LTL) specifications. Results: Evaluated on symmetric and asymmetric n-player games, SMARL significantly reduces constraint violation rates, enhances cooperative stability, and improves selection of safe equilibria. It establishes the first general-purpose MARL enhancement paradigm that simultaneously achieves formal safety verification and practical effectiveness.
📝 Abstract
Safe reinforcement learning (RL) is crucial for real-world applications, and multi-agent interactions introduce additional safety challenges. While Probabilistic Logic Shields (PLS) has been a powerful proposal to enforce safety in single-agent RL, their generalizability to multi-agent settings remains unexplored. In this paper, we address this gap by conducting extensive analyses of PLS within decentralized, multi-agent environments, and in doing so, propose Shielded Multi-Agent Reinforcement Learning (SMARL) as a general framework for steering MARL towards norm-compliant outcomes. Our key contributions are: (1) a novel Probabilistic Logic Temporal Difference (PLTD) update for shielded, independent Q-learning, which incorporates probabilistic constraints directly into the value update process; (2) a probabilistic logic policy gradient method for shielded PPO with formal safety guarantees for MARL; and (3) comprehensive evaluation across symmetric and asymmetrically shielded $n$-player game-theoretic benchmarks, demonstrating fewer constraint violations and significantly better cooperation under normative constraints. These results position SMARL as an effective mechanism for equilibrium selection, paving the way toward safer, socially aligned multi-agent systems.