Is Risk-Sensitive Reinforcement Learning Properly Resolved?

📅 2023-07-02

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing risk-sensitive reinforcement learning (RSRL) methods suffer from inherent bias when optimizing arbitrary risk measures and lack theoretical guarantees of risk-optimality or monotonic improvement. This work identifies a fundamental bias in the distributional Bellman operator under risk optimization and proposes Trajectory Q-Learning (TQL)—the first RSRL framework supporting arbitrary differentiable risk measures with unbiased estimation and provable convergence. TQL operates on trajectory-level value estimation, integrating a distributional RL architecture with a plug-and-play risk measure module. We establish its theoretical convergence to the risk-optimal policy under mild assumptions. Empirical evaluation across diverse risk-sensitive settings—including CVaR, entropy regularization, and spectral risk measures—demonstrates that TQL consistently outperforms state-of-the-art baselines in both policy performance and training stability.

📝 Abstract

Due to the nature of risk management in learning applicable policies, risk-sensitive reinforcement learning (RSRL) has been realized as an important direction. RSRL is usually achieved by learning risk-sensitive objectives characterized by various risk measures, under the framework of distributional reinforcement learning. However, it remains unclear if the distributional Bellman operator properly optimizes the RSRL objective in the sense of risk measures. In this paper, we prove that the existing RSRL methods do not achieve unbiased optimization and can not guarantee optimality or even improvements regarding risk measures over accumulated return distributions. To remedy this issue, we further propose a novel algorithm, namely Trajectory Q-Learning (TQL), for RSRL problems with provable convergence to the optimal policy. Based on our new learning architecture, we are free to introduce a general and practical implementation for different risk measures to learn disparate risk-sensitive policies. In the experiments, we verify the learnability of our algorithm and show how our method effectively achieves better performances toward risk-sensitive objectives.

Problem

Research questions and friction points this paper is trying to address.

Existing risk-sensitive RL methods fail to achieve unbiased optimization

Current approaches cannot guarantee optimality regarding risk measures

Distributional Bellman operator improperly optimizes risk-sensitive objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed Trajectory Q-Learning algorithm for RSRL

Introduced unbiased optimization with provable policy improvement

Implemented general architecture for diverse risk measures

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation