Non-Asymptotic Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This paper studies the last-iterate convergence of asynchronous average-reward Q-learning within finite time. Addressing the lack of theoretical guarantees in conventional algorithms, we propose an adaptive step-size scheme based on local clocks per state-action pair, revealing its implicit importance-sampling effect and necessity. We introduce a novel time-inhomogeneous Markov reconstruction technique to transform the non-Markovian stochastic approximation into an analyzable time-varying Markov framework, combining conditional expectation analysis with concentration inequalities for Markov chains. We establish, for the first time, an $O(1/k)$ mean-square convergence bound—under the span seminorm—for the last iterate toward the optimal relative Q-function. Furthermore, by incorporating a centering step, we achieve pointwise mean-square convergence to the optimal value function. Our results provide the first rigorous finite-time convergence guarantee for asynchronous reinforcement learning.

Technology Category

Application Category

📝 Abstract

This work presents the first finite-time analysis for the last-iterate convergence of average-reward Q-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes, which serve as local clocks for each state-action pair. We show that the iterates generated by this Q-learning algorithm converge at a rate of $O(1/k)$ (in the mean-square sense) to the optimal relative Q-function in the span seminorm. Moreover, by adding a centering step to the algorithm, we further establish pointwise mean-square convergence to a centered optimal relative Q-function, also at a rate of $O(1/k)$. To prove these results, we show that adaptive stepsizes are necessary, as without them, the algorithm fails to converge to the correct target. In addition, adaptive stepsizes can be interpreted as a form of implicit importance sampling that counteracts the effects of asynchronous updates. Technically, the use of adaptive stepsizes makes each Q-learning update depend on the entire sample history, introducing strong correlations and making the algorithm a non-Markovian stochastic approximation (SA) scheme. Our approach to overcoming this challenge involves (1) a time-inhomogeneous Markovian reformulation of non-Markovian SA, and (2) a combination of almost-sure time-varying bounds, conditioning arguments, and Markov chain concentration inequalities to break the strong correlations between the adaptive stepsizes and the iterates. The tools developed in this work are likely to be broadly applicable to the analysis of general SA algorithms with adaptive stepsizes.

Problem

Research questions and friction points this paper is trying to address.

Analyzes finite-time convergence of average-reward Q-learning with adaptive stepsizes

Proves O(1/k) convergence rate for optimal relative Q-function in span seminorm

Demonstrates necessity of adaptive stepsizes for correct target convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive stepsizes for asynchronous Q-learning convergence

Centering step ensures pointwise mean-square convergence

Non-Markovian SA analysis with time-inhomogeneous reformulation

🔎 Similar Papers

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach