SPoRt -- Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

For safety-critical model-free reinforcement learning, this paper establishes the first theoretically certified upper bound on the safety violation probability over the entire training and deployment process in an episodic setting. Methodologically, it leverages the maximum policy ratio relative to a safe baseline policy to derive theoretical guarantees; introduces scenario-theory-driven, data-efficient estimation of safety bounds; and proposes Projected PPO—a policy optimization algorithm incorporating safety-constrained projection. Contributions include: (1) the first certified safety violation probability bound jointly covering both training and deployment phases; (2) support for verification of temporal safety properties, including time-extended specifications; and (3) empirical validation across multiple continuous-control benchmarks, demonstrating a tunable safety–performance trade-off, with tight alignment between theoretical bounds and empirical violation rates, and significantly improved safety assurance over state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

To apply reinforcement learning to safety-critical applications, we ought to provide safety guarantees during both policy training and deployment. In this work we present novel theoretical results that provide a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setup: the bound, based on a `maximum policy ratio' that is computed with respect to a `safe' base policy, can also be more generally applied to temporally-extended properties (beyond safety) and to robust control problems. We thus present SPoRt, which also provides a data-driven approach for obtaining such a bound for the base policy, based on scenario theory, and which includes Projected PPO, a new projection-based approach for training the task-specific policy while maintaining a user-specified bound on property violation. Hence, SPoRt enables the user to trade off safety guarantees in exchange for task-specific performance. Accordingly, we present experimental results demonstrating this trade-off, as well as a comparison of the theoretical bound to posterior bounds based on empirical violation rates.

Problem

Research questions and friction points this paper is trying to address.

Ensures safety guarantees during RL policy training and deployment

Bounds probability of violating safety for task-specific policies

Balances safety guarantees with task performance trade-offs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximum policy ratio for safety bounds

Scenario theory for data-driven bounds

Projected PPO for safe policy training

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation