A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

📅 2024-07-26

📈 Citations: 0

✨ Influential: 0

career value

257K/year

🤖 AI Summary

Existing average-reward reinforcement learning algorithms suffer from suboptimal convergence, reliance on mixing or hitting time priors, high iteration complexity, and poor scalability to large or infinite state spaces. To address these issues, we propose a novel natural Actor-Critic framework that integrates multilevel Monte Carlo (MLMC) gradient estimation with variance reduction techniques, under general policy parameterization. Our approach eliminates the need for mixing or hitting time assumptions and establishes the first global convergence guarantee for average-reward RL without such priors. We achieve the optimal-rate convergence bound of $ ilde{O}(1/sqrt{T})$, where the rate is independent of the state-space size—enabling scalability to large or infinite domains. Empirical evaluations on average-reward MDPs demonstrate significant reductions in computational overhead and iteration complexity, while improving both scalability and practical applicability.

Technology Category

Application Category

📝 Abstract

This work examines average-reward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of $ ilde{mathcal{O}}(1/sqrt{T})$ for average-reward Markov Decision Processes (MDPs) (where $T$ is the horizon length), without requiring the knowledge of mixing and hitting times. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.

Problem

Research questions and friction points this paper is trying to address.

Address suboptimal guarantees in average-reward reinforcement learning

Overcome scalability and complexity issues in state-action spaces

Achieve global convergence without mixing or hitting time knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level Monte Carlo-based Natural Actor-Critic algorithm

Global convergence rate without mixing times knowledge

Scalable to infinite state spaces

🔎 Similar Papers

No similar papers found.