Planning and Learning in Average Risk-aware MDPs

📅 2025-03-22

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenges of modeling dynamic risk measures and the lack of convergence guarantees in average-risk-sensitive Markov decision processes (MDPs). We propose the first convergent off-policy utility-shortfall risk-aware Q-learning framework. Methodologically, we introduce the first integration of relative value iteration (RVI) with multilevel Monte Carlo (MLMC) Q-learning, yielding a solution paradigm for average-risk-sensitive MDPs that supports continuous adjustment of risk preference. We establish rigorous theoretical convergence proofs for both RVI and MLMC-Q-learning. Empirical evaluations confirm the algorithm’s effectiveness in fine-grained risk control and near-optimal policy performance. Our main contributions are: (i) extending average-cost MDPs to dynamic risk-measure settings; (ii) developing the first off-policy risk-sensitive Q-learning algorithm with provable convergence guarantees; and (iii) providing a new tool for robust sequential decision-making in continuing tasks.

Technology Category

Application Category

📝 Abstract

For continuing tasks, average cost Markov decision processes have well-documented value and can be solved using efficient algorithms. However, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo method, and an off-policy algorithm dedicated to utility-base shortfall risk measures. Both the RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirms empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.

Problem

Research questions and friction points this paper is trying to address.

Extend risk-neutral algorithms to dynamic risk measures

Develop Q-learning algorithms for risk-aware MDPs

Prove convergence and validate risk-aware policy identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends risk-neutral algorithms to dynamic risk measures

Proposes relative value iteration for risk-aware planning

Develops model-free Q-learning with multi-level Monte Carlo

🔎 Similar Papers

On shallow planning under partial observability