🤖 AI Summary
This work addresses the absence of quantile-targeted optimization mechanisms in risk-sensitive reinforcement learning. We propose UCB-QRL, the first optimistic algorithm explicitly designed for quantile objectives in finite-horizon episodic MDPs. Methodologically, we introduce the optimism principle to quantile RL by constructing upper confidence bounds (UCBs) on the quantile value function over a confidence ball of transition models, integrating online transition estimation with dynamic programming. Theoretically, we establish a high-probability regret upper bound of $mathcal{O}ig((2/kappa)^{H+1} H sqrt{SATHlog(2SATH/delta)}ig)$, where $kappa$ is the risk-level parameter. This is the first result to explicitly characterize how risk sensitivity governs sample complexity in quantile RL, thereby providing both a novel theoretical foundation and a practical algorithm for risk-aware decision-making.
📝 Abstract
Reinforcement Learning (RL) has achieved tremendous success in recent years. However, the classical foundations of RL do not account for the risk sensitivity of the objective function, which is critical in various fields, including healthcare and finance. A popular approach to incorporate risk sensitivity is to optimize a specific quantile of the cumulative reward distribution. In this paper, we develop UCB-QRL, an optimistic learning algorithm for the $ au$-quantile objective in finite-horizon Markov decision processes (MDPs). UCB-QRL is an iterative algorithm in which, at each iteration, we first estimate the underlying transition probability and then optimize the quantile value function over a confidence ball around this estimate. We show that UCB-QRL yields a high-probability regret bound $mathcal Oleft((2/kappa)^{H+1}Hsqrt{SATHlog(2SATH/delta)}
ight)$ in the episodic setting with $S$ states, $A$ actions, $T$ episodes, and $H$ horizons. Here, $kappa>0$ is a problem-dependent constant that captures the sensitivity of the underlying MDP's quantile value.