🤖 AI Summary
This paper addresses the convergence challenge of average-reward reinforcement learning in weakly communicating semi-Markov decision processes (SMDPs). We propose an asynchronous stochastic-approximation–based relative value iteration Q-learning algorithm (RVI Q-learning). Methodologically, we introduce a novel monotonicity condition to consistently estimate the optimal average reward rate, thereby relaxing the conventional strong connectivity assumption; theoretical analysis integrates the Borkar–Meyn framework with customized stability arguments. Our main contribution is the first almost-sure convergence proof of this asynchronous algorithm—on finite state spaces—to the compact connected solution set of the average-reward optimality equation. Under standard step-size and asynchronous update conditions, we further establish uniqueness of the limit point, which depends on the sample path. These results substantially extend Schweitzer’s classical RVI theory to nonstationary, weakly communicating SMDPs.
📝 Abstract
This paper applies the authors' recent results on asynchronous stochastic approximation (SA) in the Borkar-Meyn framework to reinforcement learning in average-reward semi-Markov decision processes (SMDPs). We establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. In particular, we show that the algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation, with convergence to a unique, sample path-dependent solution under additional stepsize and asynchrony conditions. Moreover, to make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework and are addressed through novel arguments in the stability and convergence analysis of RVI Q-learning.