Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning

📅 2024-09-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the almost-sure convergence of asynchronous Relative Value Iteration (RVI) Q-learning for average-reward reinforcement learning in weakly communicating semi-Markov decision processes (SMDPs). To this end, we extend the Borkar–Meyn stochastic approximation stability framework: (i) we establish the shadowing property under asynchronous updates for the first time; (ii) we introduce a novel monotonicity condition enabling exact estimation of the optimal average reward rate; and (iii) we relax the conventional i.i.d. noise assumption to accommodate general non-i.i.d. disturbances. We prove that the proposed algorithm converges almost surely in finite-state, weakly communicating SMDPs. This result significantly broadens both the applicability and theoretical guarantees of RVI-type algorithms—particularly under asynchrony and non-stationary dynamics—and provides a rigorous foundation for online policy optimization in complex, time-varying environments.

Technology Category

Application Category

📝 Abstract

This paper studies asynchronous stochastic approximation (SA) algorithms and their theoretical application to reinforcement learning in semi-Markov decision processes (SMDPs) with an average-reward criterion. We first extend Borkar and Meyn's stability proof method to accommodate more general noise conditions, yielding broader convergence guarantees for asynchronous SA. To sharpen the convergence analysis, we further examine shadowing properties in the asynchronous setting, building on a dynamical-systems approach of Hirsch and Bena""{i}m. Leveraging these SA results, we establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. Moreover, to make full use of these SA results in this application, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework, and we address them with novel arguments in the stability and convergence analysis of RVI Q-learning.

Problem

Research questions and friction points this paper is trying to address.

Extends stability proof for asynchronous stochastic approximation under general noise

Proves convergence of RVI Q-learning for average-reward SMDPs

Introduces monotonicity conditions for optimal reward rate estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended stability proof for general noise conditions

Examined shadowing properties in asynchronous setting

Introduced new monotonicity conditions for reward estimation

🔎 Similar Papers

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach