Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of high-probability regret bounds for online Q-learning in infinite-horizon discounted MDPs that do not rely on optimistic mechanisms and are robust to suboptimality gaps. The paper establishes, for the first time, a high-probability regret bound for classical online Q-learning without optimism by introducing a smoothed exploration strategy that combines εₙ-greedy with Boltzmann exploration. To achieve this, the authors develop a high-probability concentration bound for stochastic approximation tailored to time-varying, non-homogeneous Markov chains and incorporate a mixing-time-dependent contraction factor in their analysis. They prove that Boltzmann Q-learning achieves sublinear regret under large suboptimality gaps, while the proposed smoothed strategy attains a near-Õ(N^{9/10}) gap-robust regret bound.

Technology Category

Application Category

📝 Abstract
We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes, without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $ε_n$-Greedy exploration scheme that combines $ε_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our bound is governed by the mixing time and is allowed to converge to one asymptotically.
Problem

Research questions and friction points this paper is trying to address.

regret
online Q-learning
Markov decision processes
suboptimality gap
sample complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

online Q-learning
regret bound
Markovian stochastic approximation
gap-robust exploration
concentration inequality
🔎 Similar Papers
No similar papers found.
R
Rahul Singh
Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
Siddharth Chandak
Siddharth Chandak
Stanford University
Multi-Agent LearningReinforcement LearningGame TheoryStochastic Approximation
Eric Moulines
Eric Moulines
Professeur, Ecole Polytechnique, Membre de l'Académie des Sciences
StatisticsMachine learningSignal Processing
V
Vivek S. Borkar
Department of Electrical Engineering, Indian Institute of Technology Bombay, Mumbai 400076, India
N
Nicholas Bambos
Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA