Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

📅 2024-05-23
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of convergence guarantees and unified modeling capabilities in policy gradient methods for risk-sensitive reinforcement learning, this paper proposes the first provably convergent risk-sensitive distributed policy gradient framework. Methodologically, we (1) derive the first analytical gradient expression of the return distribution with respect to policy parameters; (2) design a novel algorithm—Categorical Distribution-based Policy Gradient (CDPG)—that simultaneously achieves finite-support optimality and finite-iteration convergence; and (3) ensure compatibility with a broad class of coherent risk measures. Theoretical analysis leverages tools from stochastic optimization to establish convergence and risk-sensitivity properties. Empirical evaluation on stochastic Cliffwalk and CartPole benchmarks demonstrates significant improvements in robustness, reliability, and risk mitigation compared to existing approaches.

Technology Category

Application Category

📝 Abstract
Risk-sensitive reinforcement learning (RL) is crucial for maintaining reliable performance in high-stakes applications. While traditional RL methods aim to learn a point estimate of the random cumulative cost, distributional RL (DRL) seeks to estimate the entire distribution of it, which leads to a unified framework for handling different risk measures. However, developing policy gradient methods for risk-sensitive DRL is inherently more complex as it involves finding the gradient of a probability measure. This paper introduces a new policy gradient method for risk-sensitive DRL with general coherent risk measures, where we provide an analytical form of the probability measure's gradient for any distribution. For practical use, we design a categorical distributional policy gradient algorithm (CDPG) that approximates any distribution by a categorical family supported on some fixed points. We further provide a finite-support optimality guarantee and a finite-iteration convergence guarantee under inexact policy evaluation and gradient estimation. Through experiments on stochastic Cliffwalk and CartPole environments, we illustrate the benefits of considering a risk-sensitive setting in DRL.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Risk-sensitive Problems
Strategy Gradient Method
Innovation

Methods, ideas, or system contributions that make the work stand out.

Risk-sensitive DRL
Categorical Distribution Policy Gradient
Optimality and Convergence
🔎 Similar Papers
No similar papers found.
M
Minheng Xiao
Department of Integrated Systems Engineering, The Ohio State University, Columbus, OH, USA
Xian Yu
Xian Yu
Assistant Professor, The Ohio State University
Optimization under uncertaintyStochastic programmingDistributionally robust optimizationinteger programming
L
Lei Ying
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA