On the Global Convergence of Risk-Averse Policy Gradient Methods with Expected Conditional Risk Measures

📅 2023-01-26
🏛️ International Conference on Machine Learning
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the global convergence of policy gradient (PG) and natural policy gradient (NPG) methods for risk-sensitive reinforcement learning under expectation-based conditional risk measures (ECRMs). For time-consistent ECRMs, we develop a unified PG/NPG algorithmic framework covering four practical parameterizations: constrained direct parameterization, log-barrier regularized softmax, entropy-regularized softmax, and approximate NPG. We establish, for the first time, a rigorous global optimality guarantee and iteration complexity analysis—achieving $O(1/varepsilon^2)$ for PG and $O(1/varepsilon)$ for NPG—for ECRM-based risk optimization, thereby filling a critical theoretical gap in globally convergent risk-sensitive RL. Empirical evaluation on a stochastic Cliffwalk environment demonstrates that the proposed algorithms effectively mitigate risk while maintaining stability and convergence.
📝 Abstract
Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While Policy Gradient (PG) methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case citep{mei2020global,agarwal2021theory,cen2022fast,bhandari2024global}. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive PG and Natural Policy Gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality {and iteration complexities} of the proposed algorithms under the following four settings: (i) PG with constrained direct parameterization, (ii) PG with softmax parameterization and log barrier regularization, (iii) NPG with softmax parameterization and entropy regularization, and (iv) approximate NPG with inexact policy evaluation. Furthermore, we test a risk-averse REINFORCE algorithm citep{williams1992simple} and a risk-averse NPG algorithm citep{kakade2001natural} on a stochastic Cliffwalk environment to demonstrate the efficacy of our methods and the importance of risk control.
Problem

Research questions and friction points this paper is trying to address.

Global convergence of risk-averse policy gradient methods
Risk-sensitive reinforcement learning with dynamic risk measures
Optimality and complexity of ECRM-based RL algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Gradient methods with Expected Conditional Risk Measures
Global convergence guarantees for risk-sensitive reinforcement learning
Natural Policy Gradient updates for dynamic time-consistent risk
Xian Yu
Xian Yu
Assistant Professor, The Ohio State University
Optimization under uncertaintyStochastic programmingDistributionally robust optimizationinteger programming
L
Lei Ying
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA