Safety-Aware Reinforcement Learning for Control via Risk-Sensitive Action-Value Iteration and Quantile Regression

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the suboptimal policies and violation of safety constraints in approximate value iteration RL under high-variance stochastic environments—caused by overestimation bias—this paper proposes a quantile-based action-value iteration method regularized by Conditional Value-at-Risk (CVaR). It is the first to explicitly incorporate CVaR into the quantile regression RL framework, achieving provably guaranteed safety constraints without increasing neural network complexity. We theoretically establish that the proposed risk-sensitive distributional Bellman operator is a contraction mapping under the Wasserstein metric and admits a unique fixed point. Empirical evaluation on a dynamic obstacle-avoidance-and-reach task demonstrates that, compared to risk-neutral baselines, our method improves task success rate by 23% and reduces collision rate by 41%, significantly enhancing the safety–performance trade-off.

Technology Category

Application Category

📝 Abstract
Mainstream approximate action-value iteration reinforcement learning (RL) algorithms suffer from overestimation bias, leading to suboptimal policies in high-variance stochastic environments. Quantile-based action-value iteration methods reduce this bias by learning a distribution of the expected cost-to-go using quantile regression. However, ensuring that the learned policy satisfies safety constraints remains a challenge when these constraints are not explicitly integrated into the RL framework. Existing methods often require complex neural architectures or manual tradeoffs due to combined cost functions. To address this, we propose a risk-regularized quantile-based algorithm integrating Conditional Value-at-Risk (CVaR) to enforce safety without complex architectures. We also provide theoretical guarantees on the contraction properties of the risk-sensitive distributional Bellman operator in Wasserstein space, ensuring convergence to a unique cost distribution. Simulations of a mobile robot in a dynamic reach-avoid task show that our approach leads to more goal successes, fewer collisions, and better safety-performance trade-offs compared to risk-neutral methods.
Problem

Research questions and friction points this paper is trying to address.

Overcoming overestimation bias in reinforcement learning algorithms
Ensuring safety constraints in learned policies without complex architectures
Improving safety-performance trade-offs in stochastic environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Risk-sensitive quantile regression for bias reduction
CVaR integration ensures safety constraints
Theoretical guarantees on distributional convergence
🔎 Similar Papers
C
Clinton Enwerem
Institute for Systems Research, University of Maryland, College Park, United States
A
Aniruddh Gopinath Puranic
Institute for Systems Research, University of Maryland, College Park, United States
John S. Baras
John S. Baras
Professor of Ellectrical and Computer Engineering, University of Maryland
systems theorycontrol theorycommunication networkssignal processingoptimization
C
C. Belta
Institute for Systems Research, University of Maryland, College Park, United States