Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This paper investigates tight gap-dependent regret bounds for finite-horizon tabular Markov decision processes (MDPs), focusing on how suboptimality gaps Δ at state-action pairs and conditional variances affect learning difficulty. To capture environmental stochasticity, we introduce the *maximum conditional total variance* $mathrm{Var}_{max}^c$, and establish the first lower bound incorporating this quantity—proving its necessity. We propose the *Monotonic Value Propagation* algorithm, which integrates a weighted suboptimality-gap analysis framework with explicit conditional variance modeling to achieve variance-aware regret control. Our analysis yields a tight upper bound $ ilde{O}ig(sum_{Delta > 0} frac{H^2 log K wedge mathrm{Var}_{max}^c}{Delta}ig)$ and a matching lower bound $Omegaig(sum_{Delta > 0} frac{H^2 wedge mathrm{Var}_{max}^c}{Delta} cdot log Kig)$. These results significantly improve upon existing gap-dependent regret analyses by explicitly characterizing the interplay between horizon, variance, and suboptimality structure.

Technology Category

Application Category

📝 Abstract

We consider the gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm achieves a variance-aware gap-dependent regret bound of $$ ilde{O}left(left(sum_{Delta_h(s,a)>0} frac{H^2 log K land mathtt{Var}_{max}^{ ext{c}}}{Delta_h(s,a)} +sum_{Delta_h(s,a)=0}frac{ H^2 land mathtt{Var}_{max}^{ ext{c}}}{Delta_{mathrm{min}}} + SAH^4 (S lor H) ight) log K ight),$$ where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Here, $Delta_h(s,a) =V_h^* (a) - Q_h^* (s, a)$ represents the suboptimality gap and $Delta_{mathrm{min}} := min_{Delta_h (s,a)>0} Delta_h(s,a)$. The term $mathtt{Var}_{max}^{ ext{c}}$ denotes the maximum conditional total variance, calculated as the maximum over all $(pi, h, s)$ tuples of the expected total variance under policy $pi$ conditioned on trajectories visiting state $s$ at step $h$. $mathtt{Var}_{max}^{ ext{c}}$ characterizes the maximum randomness encountered when learning any $(h, s)$ pair. Our result stems from a novel analysis of the weighted sum of the suboptimality gap and can be potentially adapted for other algorithms. To complement the study, we establish a lower bound of $$Omega left( sum_{Delta_h(s,a)>0} frac{H^2 land mathtt{Var}_{max}^{ ext{c}}}{Delta_h(s,a)}cdot log K ight),$$ demonstrating the necessity of dependence on $mathtt{Var}_{max}^{ ext{c}}$ even when the maximum unconditional total variance (without conditioning on $(h, s)$) approaches zero.

Problem

Research questions and friction points this paper is trying to address.

Analyzes gap-dependent regret bounds for episodic MDPs

Proves MVP algorithm achieves variance-aware regret bound

Establishes lower bound for variance dependence necessity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monotonic Value Propagation algorithm

Variance-aware gap-dependent regret

Novel weighted suboptimality gap analysis

🔎 Similar Papers

Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition