Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

📅 2024-05-23
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
To address the lack of convergence guarantees and unified modeling capabilities in policy gradient methods for risk-sensitive reinforcement learning, this paper proposes the first provably convergent risk-sensitive distributed policy gradient framework. Methodologically, we (1) derive the first analytical gradient expression of the return distribution with respect to policy parameters; (2) design a novel algorithm—Categorical Distribution-based Policy Gradient (CDPG)—that simultaneously achieves finite-support optimality and finite-iteration convergence; and (3) ensure compatibility with a broad class of coherent risk measures. Theoretical analysis leverages tools from stochastic optimization to establish convergence and risk-sensitivity properties. Empirical evaluation on stochastic Cliffwalk and CartPole benchmarks demonstrates significant improvements in robustness, reliability, and risk mitigation compared to existing approaches.

Technology Category

Application Category

📝 Abstract
Risk-sensitive reinforcement learning (RL) is crucial for maintaining reliable performance in high-stakes applications. While traditional RL methods aim to learn a point estimate of the random cumulative cost, distributional RL (DRL) seeks to estimate the entire distribution of it, which leads to a unified framework for handling different risk measures. However, developing policy gradient methods for risk-sensitive DRL is inherently more complex as it involves finding the gradient of a probability measure. This paper introduces a new policy gradient method for risk-sensitive DRL with general coherent risk measures, where we provide an analytical form of the probability measure's gradient for any distribution. For practical use, we design a categorical distributional policy gradient algorithm (CDPG) that approximates any distribution by a categorical family supported on some fixed points. We further provide a finite-support optimality guarantee and a finite-iteration convergence guarantee under inexact policy evaluation and gradient estimation. Through experiments on stochastic Cliffwalk and CartPole environments, we illustrate the benefits of considering a risk-sensitive setting in DRL.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Risk-sensitive Problems
Strategy Gradient Method
Innovation

Methods, ideas, or system contributions that make the work stand out.

Risk-sensitive DRL
Categorical Distribution Policy Gradient
Optimality and Convergence
🔎 Similar Papers