Optimistic Reinforcement Learning with Quantile Objectives

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the absence of quantile-targeted optimization mechanisms in risk-sensitive reinforcement learning. We propose UCB-QRL, the first optimistic algorithm explicitly designed for quantile objectives in finite-horizon episodic MDPs. Methodologically, we introduce the optimism principle to quantile RL by constructing upper confidence bounds (UCBs) on the quantile value function over a confidence ball of transition models, integrating online transition estimation with dynamic programming. Theoretically, we establish a high-probability regret upper bound of $mathcal{O}ig((2/kappa)^{H+1} H sqrt{SATHlog(2SATH/delta)}ig)$, where $kappa$ is the risk-level parameter. This is the first result to explicitly characterize how risk sensitivity governs sample complexity in quantile RL, thereby providing both a novel theoretical foundation and a practical algorithm for risk-aware decision-making.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) has achieved tremendous success in recent years. However, the classical foundations of RL do not account for the risk sensitivity of the objective function, which is critical in various fields, including healthcare and finance. A popular approach to incorporate risk sensitivity is to optimize a specific quantile of the cumulative reward distribution. In this paper, we develop UCB-QRL, an optimistic learning algorithm for the $ au$-quantile objective in finite-horizon Markov decision processes (MDPs). UCB-QRL is an iterative algorithm in which, at each iteration, we first estimate the underlying transition probability and then optimize the quantile value function over a confidence ball around this estimate. We show that UCB-QRL yields a high-probability regret bound $mathcal Oleft((2/kappa)^{H+1}Hsqrt{SATHlog(2SATH/delta)} ight)$ in the episodic setting with $S$ states, $A$ actions, $T$ episodes, and $H$ horizons. Here, $kappa>0$ is a problem-dependent constant that captures the sensitivity of the underlying MDP's quantile value.

Problem

Research questions and friction points this paper is trying to address.

Optimistic algorithm for quantile objectives in reinforcement learning

Addresses risk sensitivity in Markov decision processes

Provides high-probability regret bound for episodic settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic algorithm for quantile objective optimization

Estimates transitions and optimizes over confidence ball

Provides high-probability regret bound for MDPs

🔎 Similar Papers

No similar papers found.

Citadel

the base salary range for this role is $225,000 to $300,000

New York, NY, USA

Master Thesis Reinforcement Learning for Behavior Planning in Automated Driving

Bosch Group

Renningen, BW, DE

Authors to Follow