Provably Efficient RL for Linear MDPs under Instantaneous Safety Constraints in Non-Convex Feature Spaces

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies reinforcement learning with instantaneous hard safety constraints, aiming for zero constraint violation and efficient policy learning in non-convex or non-star-convex decision spaces—critical for high-stakes applications like autonomous driving. To address star-convex constraints, we propose Objective Constraint-Decomposition (OCD), which overcomes coverage-number-based sample-complexity bottlenecks. For general non-star-convex constraints, we introduce the first two-stage NCS-LSVI algorithm, achieving both zero safety violation and optimal regret bound $ ilde{mathcal{O}}ig((1+1/ au)sqrt{log(1/ au),d^3 H^4 K}ig)$. Our approach builds upon linear MDPs and integrates safe policy warm-starting, least-squares value iteration, and function-class covering-number analysis—ensuring both model independence and theoretical rigor. Empirical evaluation on autonomous-driving simulations validates both safety guarantees and learning efficacy.

Technology Category

Application Category

📝 Abstract
In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is non-convex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form. In this paper, we establish a regret bound of $ ilde{mathcal{O}}igl(igl(1 + frac{1}{ au}igr) sqrt{log( frac{1}{ au}) d^3 H^4 K} igr)$, applicable to both star-convex and non-star-convex cases, where $d$ is the feature dimension, $H$ the episode length, $K$ the number of episodes, and $ au$ the safety threshold. Moreover, the violation of safety constraints is zero with high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique called Objective Constraint-Decomposition (OCD) to properly bound the covering number. This result also resolves an error in a previous work on constrained RL. In non-star-convex scenarios, where the covering number can become infinitely large, we propose a two-phase algorithm, Non-Convex Safe Least Squares Value Iteration (NCS-LSVI), which first reduces uncertainty about the safe set by playing a known safe policy. After that, it carefully balances exploration and exploitation to achieve the regret bound. Finally, numerical simulations on an autonomous driving scenario demonstrate the effectiveness of NCS-LSVI.
Problem

Research questions and friction points this paper is trying to address.

Efficient RL in non-convex feature spaces
Safety constraints in autonomous systems
Regret bound with zero constraint violation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Objective Constraint-Decomposition for star-convex
Non-Convex Safe Least Squares Value Iteration
Zero safety constraint violation with high probability
🔎 Similar Papers
No similar papers found.
A
Amirhossein Roknilamouki
Department of Electrical and Computer Engineering, The Ohio State University
Arnob Ghosh
Arnob Ghosh
Assistant Professor of ECE at New Jersey Institute of Technology
Reinforcement LearningGame thoeryIntelligent Transportation SystemComputer Networks
Ming Shi
Ming Shi
Assistant Professor, The State University of New York at Buffalo
Learning TheoryOnline OptimizationNetworkingSecurity
F
Fatemeh Nourzad
Department of Electrical and Computer Engineering, The Ohio State University
Eylem Ekici
Eylem Ekici
Professor of Electrical and Computer Engineering, The Ohio State University
Wireless NetworksmmWaveV2XDynamic Spectrum Access
N
Ness B. Shroff
Department of Computer Science and Engineering, The Ohio State University