On the Global Convergence of Risk-Averse Policy Gradient Methods with Expected Conditional Risk Measures

📅 2023-01-26

🏛️ International Conference on Machine Learning

📈 Citations: 5

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work investigates the global convergence of policy gradient (PG) and natural policy gradient (NPG) methods for risk-sensitive reinforcement learning under expectation-based conditional risk measures (ECRMs). For time-consistent ECRMs, we develop a unified PG/NPG algorithmic framework covering four practical parameterizations: constrained direct parameterization, log-barrier regularized softmax, entropy-regularized softmax, and approximate NPG. We establish, for the first time, a rigorous global optimality guarantee and iteration complexity analysis—achieving $O(1/varepsilon^2)$ for PG and $O(1/varepsilon)$ for NPG—for ECRM-based risk optimization, thereby filling a critical theoretical gap in globally convergent risk-sensitive RL. Empirical evaluation on a stochastic Cliffwalk environment demonstrates that the proposed algorithms effectively mitigate risk while maintaining stability and convergence.

📝 Abstract

Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While Policy Gradient (PG) methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case citep{mei2020global,agarwal2021theory,cen2022fast,bhandari2024global}. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive PG and Natural Policy Gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality {and iteration complexities} of the proposed algorithms under the following four settings: (i) PG with constrained direct parameterization, (ii) PG with softmax parameterization and log barrier regularization, (iii) NPG with softmax parameterization and entropy regularization, and (iv) approximate NPG with inexact policy evaluation. Furthermore, we test a risk-averse REINFORCE algorithm citep{williams1992simple} and a risk-averse NPG algorithm citep{kakade2001natural} on a stochastic Cliffwalk environment to demonstrate the efficacy of our methods and the importance of risk control.

Problem

Research questions and friction points this paper is trying to address.

Global convergence of risk-averse policy gradient methods

Risk-sensitive reinforcement learning with dynamic risk measures

Optimality and complexity of ECRM-based RL algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Gradient methods with Expected Conditional Risk Measures

Global convergence guarantees for risk-sensitive reinforcement learning

Natural Policy Gradient updates for dynamic time-consistent risk

🔎 Similar Papers

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence