Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of establishing non-asymptotic convergence guarantees for average-reward Q-learning, which is hindered by the lack of contractivity of the Bellman operator. Existing approaches either rely on strong assumptions or yield suboptimal sample complexity. Under standard reachability assumptions, the authors introduce a variant of Q-learning based on lazified dynamics and propose a novel instance-dependent seminorm under which the Bellman operator becomes a first-order contraction. Leveraging this new contraction property, they achieve—under significantly weaker assumptions—the optimal sample complexity of $\widetilde{O}(\varepsilon^{-2})$ (up to logarithmic factors) for both synchronous and asynchronous Q-learning, markedly improving upon prior results.

Technology Category

Application Category

📝 Abstract

We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes, where the absence of contraction poses a fundamental challenge. Existing non-asymptotic results overcome this challenge by either imposing strong assumptions to enforce seminorm contraction or relying on discounted or episodic Markov decision processes as successive approximations, which either require unknown parameters or result in suboptimal sample complexity. In this work, under a reachability assumption, we establish optimal $\widetilde{O}(\varepsilon^{-2})$ sample complexity guarantees (up to logarithmic factors) for a simple variant of synchronous and asynchronous Q-learning that samples from the lazified dynamics, where the system remains in the current state with some fixed probability. At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.

Problem

Research questions and friction points this paper is trying to address.

average-reward MDP

Q-learning

contraction

sample complexity

non-asymptotic convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

average-reward Q-learning

contraction principle

lazified dynamics