π€ AI Summary
This work addresses the challenge of enabling an agent in stochastic environments to satisfy reach-avoid specifications with probability at least \( p \) while minimizing expected cumulative costβa dual objective that existing methods struggle to balance. To this end, the paper introduces Reach-Avoid Probabilistic Certificates (RAPCs) to characterize the set of states from which the specification can be satisfied with the required probability. Leveraging RAPCs, the authors formulate a contractive Bellman equation that intrinsically embeds the probabilistic constraint into the reinforcement learning framework. This approach is the first to jointly guarantee both probabilistic satisfaction of temporal specifications and cost optimality in stochastic reinforcement learning, with theoretical proof that the learned policy converges almost surely to a local optimum. Empirical results on MuJoCo benchmarks demonstrate significantly reduced cumulative costs while consistently achieving higher constraint satisfaction rates compared to prior methods.
π Abstract
We study stochastic minimum-cost reach-avoid reinforcement learning, where an agent must satisfy a reach-avoid specification with probability at least $p$ while minimizing expected cumulative costs in stochastic environments. Existing safe and constrained reinforcement learning methods typically fail to jointly enforce probabilistic reach-avoid constraints and optimize cost in the learning setting in stochastic environments. To address this challenge, we introduce reach-avoid probability certificates (RAPCs), which identify states from which stochastic reach-avoid constraints are satisfiable. Building on RAPCs, we develop a contraction-based Bellman formulation that serves as a principled surrogate for integrating reach-avoid considerations into reinforcement learning, enabling cost optimization under probabilistic constraints. We establish almost sure convergence of the proposed algorithms to locally optimal policies with respect to the resulting objective. Experiments in the MuJoCo simulator demonstrate improved cost performance and consistently higher reach-avoid satisfaction rates.