Data- and Variance-dependent Regret Bounds for Online Tabular MDPs

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a “best-of-both-worlds” algorithm for online tabular Markov Decision Processes (MDPs) with known transition probabilities, capable of simultaneously adapting to both adversarial and stochastic environments. The approach introduces data-dependent complexity measures—such as second-order variation and path length—and incorporates a variance-aware mechanism within a novel Q-function estimator built upon an optimistic Follow-the-Regularized-Leader framework with log-barrier regularization. By synergistically combining global and policy optimization techniques, the algorithm achieves first-order, second-order, and path-length-dependent regret bounds in adversarial settings. In stochastic environments, it attains gap-dependent and gap-independent regret bounds that scale with the variance, with the former exhibiting only polylogarithmic dependence on the number of episodes. Information-theoretic lower bounds demonstrate that the proposed algorithm is nearly optimal across multiple regimes.

Technology Category

Application Category

📝 Abstract
This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a variance-aware gap-independent bound and a variance-aware gap-dependent bound that is polylogarithmic in the number of episodes. For policy optimization, our algorithms achieve the same data- and variance-dependent adaptivity, up to a factor of the episode horizon, by exploiting a new optimistic $Q$-function estimator. Finally, we establish regret lower bounds in terms of data-dependent complexity measures for the adversarial regime and a variance measure for the stochastic regime, implying that the regret upper bounds achieved by the global-optimization approach are nearly optimal.
Problem

Research questions and friction points this paper is trying to address.

online tabular MDPs
regret bounds
data-dependent
variance-dependent
best-of-both-worlds
Innovation

Methods, ideas, or system contributions that make the work stand out.

best-of-both-worlds
data-dependent regret
variance-dependent regret
optimistic FTRL
log-barrier regularization
🔎 Similar Papers
No similar papers found.
M
Mingyi Li
The University of Tokyo
T
Taira Tsuchiya
The University of Tokyo and RIKEN
Kenji Yamanishi
Kenji Yamanishi
The University of Tokyo
data miningdata sciencelearning theoryinformation theory