Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
This work addresses the critical challenge of simultaneously improving policy performance and ensuring safety in offline reinforcement learning with limited data. It introduces, for the first time, a probabilistic safety shielding mechanism into Safe Policy Improvement (SPI), dynamically pruning the action space during policy optimization by leveraging static datasets along with known safe and unsafe state information. This approach guarantees— with high probability—that the learned policy not only outperforms a baseline but also adheres to prescribed safety constraints. By integrating model-based safety verification with offline policy improvement, the method significantly enhances robustness and reliability in low-data regimes. Empirical results demonstrate that the shielded SPI consistently surpasses its unshielded counterpart in both average performance and worst-case outcomes, with particularly pronounced advantages under severe data scarcity.
📝 Abstract
In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded SPI outperforms its unshielded counterpart, improving both average and worst-case performance, particularly in low-data regimes.
Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning
safety guarantee
performance guarantee
safe policy improvement
shielding
Innovation

Methods, ideas, or system contributions that make the work stand out.

offline reinforcement learning
probabilistic shielding
safe policy improvement
safety guarantees
performance guarantee