🤖 AI Summary
This paper studies reinforcement learning with instantaneous hard safety constraints, aiming for zero constraint violation and efficient policy learning in non-convex or non-star-convex decision spaces—critical for high-stakes applications like autonomous driving. To address star-convex constraints, we propose Objective Constraint-Decomposition (OCD), which overcomes coverage-number-based sample-complexity bottlenecks. For general non-star-convex constraints, we introduce the first two-stage NCS-LSVI algorithm, achieving both zero safety violation and optimal regret bound $ ilde{mathcal{O}}ig((1+1/ au)sqrt{log(1/ au),d^3 H^4 K}ig)$. Our approach builds upon linear MDPs and integrates safe policy warm-starting, least-squares value iteration, and function-class covering-number analysis—ensuring both model independence and theoretical rigor. Empirical evaluation on autonomous-driving simulations validates both safety guarantees and learning efficacy.
📝 Abstract
In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is non-convex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form. In this paper, we establish a regret bound of $ ilde{mathcal{O}}igl(igl(1 + frac{1}{ au}igr) sqrt{log( frac{1}{ au}) d^3 H^4 K} igr)$, applicable to both star-convex and non-star-convex cases, where $d$ is the feature dimension, $H$ the episode length, $K$ the number of episodes, and $ au$ the safety threshold. Moreover, the violation of safety constraints is zero with high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique called Objective Constraint-Decomposition (OCD) to properly bound the covering number. This result also resolves an error in a previous work on constrained RL. In non-star-convex scenarios, where the covering number can become infinitely large, we propose a two-phase algorithm, Non-Convex Safe Least Squares Value Iteration (NCS-LSVI), which first reduces uncertainty about the safe set by playing a known safe policy. After that, it carefully balances exploration and exploitation to achieve the regret bound. Finally, numerical simulations on an autonomous driving scenario demonstrate the effectiveness of NCS-LSVI.