🤖 AI Summary
This work investigates the **full tail distribution** of the cumulative regret $R_K$ for optimistic reinforcement learning (e.g., UCBVI) in finite-horizon tabular MDPs with unknown transition dynamics—going beyond standard expected or single-point high-probability bounds. We propose two exploration reward scheduling schemes: a global, $K$-dependent schedule and a local, episode-dependent schedule. Coupled with refined probabilistic analysis, these yield the first **instance-dependent, two-phase tail bound** for standard optimistic algorithms: sub-Gaussian decay for small $x$, transitioning to sub-Weibull decay for large $x$—revealing an intrinsic piecewise structure in the regret distribution. We further derive matching expected regret bounds and introduce a tunable parameter to explicitly trade off central tendency against tail extent. Our results provide the most fine-grained distributional guarantees for optimistic RL regret to date.
📝 Abstract
We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. Focusing on a UCBVI-type algorithm, we characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes, rather than only its expectation or a single high-probability quantile. We analyze two natural exploration-bonus schedules: (i) a $K$-dependent scheme that explicitly incorporates the total number of episodes $K$, and (ii) a $K$-independent scheme that depends only on the current episode index. For both settings, we obtain an upper bound on $Pr(R_K ge x)$ that exhibits a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale $m_K$ up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret $mathbb{E}[R_K]$. The proposed algorithm depends on a tuning parameter $α$, which balances the expected regret and the range over which the regret exhibits a sub-Gaussian tail. To the best of our knowledge, our results provide one of the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning.