🤖 AI Summary
This work addresses the almost-sure convergence of asynchronous Relative Value Iteration (RVI) Q-learning for average-reward reinforcement learning in weakly communicating semi-Markov decision processes (SMDPs). To this end, we extend the Borkar–Meyn stochastic approximation stability framework: (i) we establish the shadowing property under asynchronous updates for the first time; (ii) we introduce a novel monotonicity condition enabling exact estimation of the optimal average reward rate; and (iii) we relax the conventional i.i.d. noise assumption to accommodate general non-i.i.d. disturbances. We prove that the proposed algorithm converges almost surely in finite-state, weakly communicating SMDPs. This result significantly broadens both the applicability and theoretical guarantees of RVI-type algorithms—particularly under asynchrony and non-stationary dynamics—and provides a rigorous foundation for online policy optimization in complex, time-varying environments.
📝 Abstract
This paper studies asynchronous stochastic approximation (SA) algorithms and their theoretical application to reinforcement learning in semi-Markov decision processes (SMDPs) with an average-reward criterion. We first extend Borkar and Meyn's stability proof method to accommodate more general noise conditions, yielding broader convergence guarantees for asynchronous SA. To sharpen the convergence analysis, we further examine shadowing properties in the asynchronous setting, building on a dynamical-systems approach of Hirsch and Bena""{i}m. Leveraging these SA results, we establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. Moreover, to make full use of these SA results in this application, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework, and we address them with novel arguments in the stability and convergence analysis of RVI Q-learning.