Quantile Markov Decision Process

📅 2017-11-15
🏛️ arXiv.org
📈 Citations: 5
Influential: 1
📄 PDF
🤖 AI Summary
Traditional Markov decision processes (MDPs) optimize the expected cumulative reward, which is insufficient for risk-sensitive decision-making where controlling tail-risk—e.g., guaranteeing high probability of achieving a minimum reward—is critical. Method: This paper introduces the Quantile MDP (QMDP) framework, the first systematic formulation for optimizing quantiles (e.g., 5%-ile) of the cumulative reward distribution. It establishes rigorous theoretical foundations, proving existence and characterizing the structure of optimal policies—which are inherently non-Markovian and stochastic. The proposed algorithm integrates ordinal optimization, extended dynamic programming, and quantile-specific recursive computation, combining policy iteration with Monte Carlo estimation for efficiency. Contribution/Results: Evaluated on grid-world navigation and HIV treatment simulation, QMDP significantly improves low-quantile (e.g., 5%) cumulative rewards compared to expectation-based baselines, yielding more robust and risk-controllable policies without compromising average performance.
📝 Abstract
In this paper, we consider the problem of optimizing the quantiles of the cumulative rewards of Markov Decision Processes (MDP), to which we refers as Quantile Markov Decision Processes (QMDP). Traditionally, the goal of a Markov Decision Process (MDP) is to maximize expected cumulative reward over a defined horizon (possibly to be infinite). In many applications, however, a decision maker may be interested in optimizing a specific quantile of the cumulative reward instead of its expectation. (If we have some reference here, it would be good.) Our framework of QMDP provides analytical results characterizing the optimal QMDP solution and presents the algorithm for solving the QMDP. We provide analytical results characterizing the optimal QMDP solution and present the algorithms for solving the QMDP. We illustrate the model with two experiments: a grid game and a HIV optimal treatment experiment.
Problem

Research questions and friction points this paper is trying to address.

Optimizing quantiles of cumulative rewards in Markov decision processes
Developing dynamic programming algorithms for optimal QMDP policies
Applying quantile optimization to HIV treatment risk-benefit analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes quantiles of cumulative MDP rewards
Uses dynamic programming for optimal policy solution
Extends algorithm to CVaR objective in MDPs
🔎 Similar Papers
No similar papers found.
Xiaocheng Li
Xiaocheng Li
Imperial College Business School, Imperial College London
Machine learningoperations research
Huaiyang Zhong
Huaiyang Zhong
Assistant Professor, Virginia Tech
M
Margaret L. Brandeau
Department of Management Science and Engineering, Stanford University, Stanford, CA, 94305