Policy Gradient Optimzation for Bayesian-Risk MDPs with General Convex Losses

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This paper studies risk-sensitive Markov decision processes (MDPs) with unknown parameters and general convex loss functions. Due to Bayesian posterior modeling of epistemic uncertainty, the loss structure becomes non-exchangeable, rendering classical Bellman equations inapplicable. To address this, we propose a novel policy gradient method: for the first time, we extend the envelope theorem to continuous-time risk-sensitive MDPs, integrating dual risk representations with Bayesian posterior inference to enable policy optimization under non-exchangeable losses. We establish global convergence of the algorithm in the episodic setting, with a convergence rate of $O(T^{-1/2} + r^{-1/2})$, and derive a tight upper bound on the number of iterations required to achieve $O(varepsilon)$-accuracy. Our key contribution is overcoming the exchangeability restriction, yielding the first computationally tractable policy gradient framework for Bayesian risk-sensitive MDPs with rigorous theoretical guarantees.

Technology Category

Application Category

📝 Abstract

Motivated by many application problems, we consider Markov decision processes (MDPs) with a general loss function and unknown parameters. To mitigate the epistemic uncertainty associated with unknown parameters, we take a Bayesian approach to estimate the parameters from data and impose a coherent risk functional (with respect to the Bayesian posterior distribution) on the loss. Since this formulation usually does not satisfy the interchangeability principle, it does not admit Bellman equations and cannot be solved by approaches based on dynamic programming. Therefore, We propose a policy gradient optimization method, leveraging the dual representation of coherent risk measures and extending the envelope theorem to continuous cases. We then show the stationary analysis of the algorithm with a convergence rate of $O(T^{-1/2}+r^{-1/2})$, where $T$ is the number of policy gradient iterations and $r$ is the sample size of the gradient estimator. We further extend our algorithm to an episodic setting, and establish the global convergence of the extended algorithm and provide bounds on the number of iterations needed to achieve an error bound $O(ε)$ in each episode.

Problem

Research questions and friction points this paper is trying to address.

Solving MDPs with unknown parameters and convex losses

Addressing epistemic uncertainty via Bayesian risk functionals

Developing policy gradient methods without Bellman equations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy gradient optimization for Bayesian-risk MDPs

Dual representation of coherent risk measures

Envelope theorem extension to continuous cases

🔎 Similar Papers

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence