Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the high burn-in cost and limited adaptivity of online reinforcement learning in infinite-horizon Markov decision processes (MDPs). The authors propose a unified UCB-type algorithm that, for the first time, simultaneously supports both average-reward and γ-discounted regret objectives while achieving optimal variance-dependent regret bounds. The algorithm precisely characterizes the dependence on the optimal bias span across both leading and lower-order terms: the leading term scales as Õ(√(SA·Var)), attaining minimax optimality in the worst case and nearly constant regret in deterministic MDPs; the lower-order term achieves theoretical optimality when the bias span is known and nearly matches the lower bound when unknown, thereby revealing the critical role of prior knowledge in algorithmic performance.

Technology Category

Application Category

📝 Abstract

Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $γ$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $γ$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\Vert h^\star\Vert_\text{sp}$, our algorithm obtains lower-order terms scaling as $\Vert h^\star\Vert_\text{sp} S^2 A$, which we prove is optimal in both $\Vert h^\star\Vert_\text{sp}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\Vert h^\star \Vert_\text{sp}^2 S A$, and we provide a prior-free algorithm whose lower-order terms scale as $\Vert h^\star\Vert_\text{sp}^2 S^3 A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\Vert h^\star\Vert_\text{sp}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

Problem

Research questions and friction points this paper is trying to address.

infinite-horizon MDPs

regret bounds

variance-dependent

average-reward

γ-regret

Innovation

Methods, ideas, or system contributions that make the work stand out.

variance-dependent regret

infinite-horizon MDPs

average-reward regret