Data- and Variance-dependent Regret Bounds for Online Tabular MDPs

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work proposes a “best-of-both-worlds” algorithm for online tabular Markov Decision Processes (MDPs) with known transition probabilities, capable of simultaneously adapting to both adversarial and stochastic environments. The approach introduces data-dependent complexity measures—such as second-order variation and path length—and incorporates a variance-aware mechanism within a novel Q-function estimator built upon an optimistic Follow-the-Regularized-Leader framework with log-barrier regularization. By synergistically combining global and policy optimization techniques, the algorithm achieves first-order, second-order, and path-length-dependent regret bounds in adversarial settings. In stochastic environments, it attains gap-dependent and gap-independent regret bounds that scale with the variance, with the former exhibiting only polylogarithmic dependence on the number of episodes. Information-theoretic lower bounds demonstrate that the proposed algorithm is nearly optimal across multiple regimes.

Technology Category

Application Category

📝 Abstract

This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a variance-aware gap-independent bound and a variance-aware gap-dependent bound that is polylogarithmic in the number of episodes. For policy optimization, our algorithms achieve the same data- and variance-dependent adaptivity, up to a factor of the episode horizon, by exploiting a new optimistic $Q$-function estimator. Finally, we establish regret lower bounds in terms of data-dependent complexity measures for the adversarial regime and a variance measure for the stochastic regime, implying that the regret upper bounds achieved by the global-optimization approach are nearly optimal.

Problem

Research questions and friction points this paper is trying to address.

online tabular MDPs

regret bounds

data-dependent

variance-dependent

best-of-both-worlds

Innovation

Methods, ideas, or system contributions that make the work stand out.

best-of-both-worlds

data-dependent regret

variance-dependent regret