Quantile Markov Decision Process

📅 2017-11-15

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 1

career value

185K/year

🤖 AI Summary

Traditional Markov decision processes (MDPs) optimize the expected cumulative reward, which is insufficient for risk-sensitive decision-making where controlling tail-risk—e.g., guaranteeing high probability of achieving a minimum reward—is critical. Method: This paper introduces the Quantile MDP (QMDP) framework, the first systematic formulation for optimizing quantiles (e.g., 5%-ile) of the cumulative reward distribution. It establishes rigorous theoretical foundations, proving existence and characterizing the structure of optimal policies—which are inherently non-Markovian and stochastic. The proposed algorithm integrates ordinal optimization, extended dynamic programming, and quantile-specific recursive computation, combining policy iteration with Monte Carlo estimation for efficiency. Contribution/Results: Evaluated on grid-world navigation and HIV treatment simulation, QMDP significantly improves low-quantile (e.g., 5%) cumulative rewards compared to expectation-based baselines, yielding more robust and risk-controllable policies without compromising average performance.

📝 Abstract

In this paper, we consider the problem of optimizing the quantiles of the cumulative rewards of Markov Decision Processes (MDP), to which we refers as Quantile Markov Decision Processes (QMDP). Traditionally, the goal of a Markov Decision Process (MDP) is to maximize expected cumulative reward over a defined horizon (possibly to be infinite). In many applications, however, a decision maker may be interested in optimizing a specific quantile of the cumulative reward instead of its expectation. (If we have some reference here, it would be good.) Our framework of QMDP provides analytical results characterizing the optimal QMDP solution and presents the algorithm for solving the QMDP. We provide analytical results characterizing the optimal QMDP solution and present the algorithms for solving the QMDP. We illustrate the model with two experiments: a grid game and a HIV optimal treatment experiment.

Problem

Research questions and friction points this paper is trying to address.

Optimizing quantiles of cumulative rewards in Markov decision processes

Developing dynamic programming algorithms for optimal QMDP policies

Applying quantile optimization to HIV treatment risk-benefit analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes quantiles of cumulative MDP rewards

Uses dynamic programming for optimal policy solution

Extends algorithm to CVaR objective in MDPs

🔎 Similar Papers

No similar papers found.