🤖 AI Summary
Traditional Markov decision processes (MDPs) optimize the expected cumulative reward, which is insufficient for risk-sensitive decision-making where controlling tail-risk—e.g., guaranteeing high probability of achieving a minimum reward—is critical. Method: This paper introduces the Quantile MDP (QMDP) framework, the first systematic formulation for optimizing quantiles (e.g., 5%-ile) of the cumulative reward distribution. It establishes rigorous theoretical foundations, proving existence and characterizing the structure of optimal policies—which are inherently non-Markovian and stochastic. The proposed algorithm integrates ordinal optimization, extended dynamic programming, and quantile-specific recursive computation, combining policy iteration with Monte Carlo estimation for efficiency. Contribution/Results: Evaluated on grid-world navigation and HIV treatment simulation, QMDP significantly improves low-quantile (e.g., 5%) cumulative rewards compared to expectation-based baselines, yielding more robust and risk-controllable policies without compromising average performance.
📝 Abstract
In this paper, we consider the problem of optimizing the quantiles of the cumulative rewards of Markov Decision Processes (MDP), to which we refers as Quantile Markov Decision Processes (QMDP). Traditionally, the goal of a Markov Decision Process (MDP) is to maximize expected cumulative reward over a defined horizon (possibly to be infinite). In many applications, however, a decision maker may be interested in optimizing a specific quantile of the cumulative reward instead of its expectation. (If we have some reference here, it would be good.) Our framework of QMDP provides analytical results characterizing the optimal QMDP solution and presents the algorithm for solving the QMDP. We provide analytical results characterizing the optimal QMDP solution and present the algorithms for solving the QMDP. We illustrate the model with two experiments: a grid game and a HIV optimal treatment experiment.