Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the critical challenge of simultaneously improving policy performance and ensuring safety in offline reinforcement learning with limited data. It introduces, for the first time, a probabilistic safety shielding mechanism into Safe Policy Improvement (SPI), dynamically pruning the action space during policy optimization by leveraging static datasets along with known safe and unsafe state information. This approach guarantees— with high probability—that the learned policy not only outperforms a baseline but also adheres to prescribed safety constraints. By integrating model-based safety verification with offline policy improvement, the method significantly enhances robustness and reliability in low-data regimes. Empirical results demonstrate that the shielded SPI consistently surpasses its unshielded counterpart in both average performance and worst-case outcomes, with particularly pronounced advantages under severe data scarcity.

📝 Abstract

In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded SPI outperforms its unshielded counterpart, improving both average and worst-case performance, particularly in low-data regimes.

Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning

safety guarantee

performance guarantee

safe policy improvement

shielding

Innovation

Methods, ideas, or system contributions that make the work stand out.

offline reinforcement learning

probabilistic shielding

safe policy improvement