Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This paper addresses the convergence challenge of average-reward reinforcement learning in weakly communicating semi-Markov decision processes (SMDPs). We propose an asynchronous stochastic-approximation–based relative value iteration Q-learning algorithm (RVI Q-learning). Methodologically, we introduce a novel monotonicity condition to consistently estimate the optimal average reward rate, thereby relaxing the conventional strong connectivity assumption; theoretical analysis integrates the Borkar–Meyn framework with customized stability arguments. Our main contribution is the first almost-sure convergence proof of this asynchronous algorithm—on finite state spaces—to the compact connected solution set of the average-reward optimality equation. Under standard step-size and asynchronous update conditions, we further establish uniqueness of the limit point, which depends on the sample path. These results substantially extend Schweitzer’s classical RVI theory to nonstationary, weakly communicating SMDPs.

Technology Category

Application Category

📝 Abstract

This paper applies the authors' recent results on asynchronous stochastic approximation (SA) in the Borkar-Meyn framework to reinforcement learning in average-reward semi-Markov decision processes (SMDPs). We establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. In particular, we show that the algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation, with convergence to a unique, sample path-dependent solution under additional stepsize and asynchrony conditions. Moreover, to make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework and are addressed through novel arguments in the stability and convergence analysis of RVI Q-learning.

Problem

Research questions and friction points this paper is trying to address.

Convergence of asynchronous RVI Q-learning in SMDPs

Establishing monotonicity conditions for optimal reward estimation

Extending algorithmic framework for average-reward reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous stochastic approximation for reinforcement learning

Relative value iteration algorithm convergence in SMDPs

Monotonicity conditions for optimal reward rate estimation

🔎 Similar Papers

Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning