Provably Efficient RL under Episode-Wise Safety in Linear CMDPs

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This paper studies safe reinforcement learning in linear constrained Markov decision processes (Linear CMDPs), aiming to maximize cumulative reward while strictly satisfying a total utility constraint within each episode. Addressing the lack of episode-wise zero constraint violation guarantees under function approximation in prior work, we propose the first online algorithm for this setting: a two-layer confidence set construction grounded in optimistic policy optimization, integrating linear function approximation, Lagrangian dual updates, and a safety-aware exploration mechanism. Our algorithm achieves strict episode-wise zero constraint violation with polynomial time complexity and attains the optimal cumulative reward regret bound of $widetilde{O}(sqrt{K})$. Crucially, its computational cost is independent of the state space size, ensuring both theoretical optimality and practical scalability.

Technology Category

Application Category

📝 Abstract

We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves $widetilde{mathcal{O}}(sqrt{K})$ regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement learning in constrained Markov decision processes.

Ensuring safety constraints in every episode.

Achieving efficient computation without state space dependency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning in Linear CMDPs

Episodic Safety Constraints

Polynomial Computational Efficiency

🔎 Similar Papers

Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning