Tail Distribution of Regret in Optimistic Reinforcement Learning

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work investigates the **full tail distribution** of the cumulative regret $R_K$ for optimistic reinforcement learning (e.g., UCBVI) in finite-horizon tabular MDPs with unknown transition dynamics—going beyond standard expected or single-point high-probability bounds. We propose two exploration reward scheduling schemes: a global, $K$-dependent schedule and a local, episode-dependent schedule. Coupled with refined probabilistic analysis, these yield the first **instance-dependent, two-phase tail bound** for standard optimistic algorithms: sub-Gaussian decay for small $x$, transitioning to sub-Weibull decay for large $x$—revealing an intrinsic piecewise structure in the regret distribution. We further derive matching expected regret bounds and introduce a tunable parameter to explicitly trade off central tendency against tail extent. Our results provide the most fine-grained distributional guarantees for optimistic RL regret to date.

Technology Category

Application Category

📝 Abstract

We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. Focusing on a UCBVI-type algorithm, we characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes, rather than only its expectation or a single high-probability quantile. We analyze two natural exploration-bonus schedules: (i) a $K$-dependent scheme that explicitly incorporates the total number of episodes $K$, and (ii) a $K$-independent scheme that depends only on the current episode index. For both settings, we obtain an upper bound on $Pr(R_K ge x)$ that exhibits a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale $m_K$ up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret $mathbb{E}[R_K]$. The proposed algorithm depends on a tuning parameter $α$, which balances the expected regret and the range over which the regret exhibits a sub-Gaussian tail. To the best of our knowledge, our results provide one of the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning.

Problem

Research questions and friction points this paper is trying to address.

Analyzing tail distribution of cumulative regret in optimistic RL algorithms

Deriving instance-dependent regret bounds for finite-horizon tabular MDPs

Characterizing two-regime tail behavior from sub-Gaussian to sub-Weibull

Innovation

Methods, ideas, or system contributions that make the work stand out.

UCBVI-type algorithm with tuning parameter

Two-regime tail bound structure analysis

K-dependent and K-independent exploration schedules

🔎 Similar Papers

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret