Optimistic Reinforcement Learning with Quantile Objectives

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of quantile-targeted optimization mechanisms in risk-sensitive reinforcement learning. We propose UCB-QRL, the first optimistic algorithm explicitly designed for quantile objectives in finite-horizon episodic MDPs. Methodologically, we introduce the optimism principle to quantile RL by constructing upper confidence bounds (UCBs) on the quantile value function over a confidence ball of transition models, integrating online transition estimation with dynamic programming. Theoretically, we establish a high-probability regret upper bound of $mathcal{O}ig((2/kappa)^{H+1} H sqrt{SATHlog(2SATH/delta)}ig)$, where $kappa$ is the risk-level parameter. This is the first result to explicitly characterize how risk sensitivity governs sample complexity in quantile RL, thereby providing both a novel theoretical foundation and a practical algorithm for risk-aware decision-making.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has achieved tremendous success in recent years. However, the classical foundations of RL do not account for the risk sensitivity of the objective function, which is critical in various fields, including healthcare and finance. A popular approach to incorporate risk sensitivity is to optimize a specific quantile of the cumulative reward distribution. In this paper, we develop UCB-QRL, an optimistic learning algorithm for the $ au$-quantile objective in finite-horizon Markov decision processes (MDPs). UCB-QRL is an iterative algorithm in which, at each iteration, we first estimate the underlying transition probability and then optimize the quantile value function over a confidence ball around this estimate. We show that UCB-QRL yields a high-probability regret bound $mathcal Oleft((2/kappa)^{H+1}Hsqrt{SATHlog(2SATH/delta)} ight)$ in the episodic setting with $S$ states, $A$ actions, $T$ episodes, and $H$ horizons. Here, $kappa>0$ is a problem-dependent constant that captures the sensitivity of the underlying MDP's quantile value.
Problem

Research questions and friction points this paper is trying to address.

Optimistic algorithm for quantile objectives in reinforcement learning
Addresses risk sensitivity in Markov decision processes
Provides high-probability regret bound for episodic settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic algorithm for quantile objective optimization
Estimates transitions and optimizes over confidence ball
Provides high-probability regret bound for MDPs
🔎 Similar Papers
No similar papers found.
Mohammad Alipour-Vaezi
Mohammad Alipour-Vaezi
Grado Department of Industrial & Systems Engineering, Virginia Tech, Blacksburg, VA 24061, USA
Huaiyang Zhong
Huaiyang Zhong
Assistant Professor, Virginia Tech
K
Kwok-Leung Tsui
Department of Industrial, Manufacturing, and Systems Engineering, University of Texas at Arlington, Arlington, TX 76019, USA
S
S. Khodadadian
Grado Department of Industrial & Systems Engineering, Virginia Tech, Blacksburg, VA 24061, USA