🤖 AI Summary
For safety-critical model-free reinforcement learning, this paper establishes the first theoretically certified upper bound on the safety violation probability over the entire training and deployment process in an episodic setting. Methodologically, it leverages the maximum policy ratio relative to a safe baseline policy to derive theoretical guarantees; introduces scenario-theory-driven, data-efficient estimation of safety bounds; and proposes Projected PPO—a policy optimization algorithm incorporating safety-constrained projection. Contributions include: (1) the first certified safety violation probability bound jointly covering both training and deployment phases; (2) support for verification of temporal safety properties, including time-extended specifications; and (3) empirical validation across multiple continuous-control benchmarks, demonstrating a tunable safety–performance trade-off, with tight alignment between theoretical bounds and empirical violation rates, and significantly improved safety assurance over state-of-the-art baselines.
📝 Abstract
To apply reinforcement learning to safety-critical applications, we ought to provide safety guarantees during both policy training and deployment. In this work we present novel theoretical results that provide a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setup: the bound, based on a `maximum policy ratio' that is computed with respect to a `safe' base policy, can also be more generally applied to temporally-extended properties (beyond safety) and to robust control problems. We thus present SPoRt, which also provides a data-driven approach for obtaining such a bound for the base policy, based on scenario theory, and which includes Projected PPO, a new projection-based approach for training the task-specific policy while maintaining a user-specified bound on property violation. Hence, SPoRt enables the user to trade off safety guarantees in exchange for task-specific performance. Accordingly, we present experimental results demonstrating this trade-off, as well as a comparison of the theoretical bound to posterior bounds based on empirical violation rates.