A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

๐Ÿ“… 2024-07-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing average-reward reinforcement learning algorithms suffer from suboptimal convergence, reliance on mixing or hitting time priors, high iteration complexity, and poor scalability to large or infinite state spaces. To address these issues, we propose a novel natural Actor-Critic framework that integrates multilevel Monte Carlo (MLMC) gradient estimation with variance reduction techniques, under general policy parameterization. Our approach eliminates the need for mixing or hitting time assumptions and establishes the first global convergence guarantee for average-reward RL without such priors. We achieve the optimal-rate convergence bound of $ ilde{O}(1/sqrt{T})$, where the rate is independent of the state-space sizeโ€”enabling scalability to large or infinite domains. Empirical evaluations on average-reward MDPs demonstrate significant reductions in computational overhead and iteration complexity, while improving both scalability and practical applicability.

Technology Category

Application Category

๐Ÿ“ Abstract
This work examines average-reward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of $ ilde{mathcal{O}}(1/sqrt{T})$ for average-reward Markov Decision Processes (MDPs) (where $T$ is the horizon length), without requiring the knowledge of mixing and hitting times. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.
Problem

Research questions and friction points this paper is trying to address.

Address suboptimal guarantees in average-reward reinforcement learning
Overcome scalability and complexity issues in state-action spaces
Achieve global convergence without mixing or hitting time knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level Monte Carlo-based Natural Actor-Critic algorithm
Global convergence rate without mixing times knowledge
Scalable to infinite state spaces
๐Ÿ”Ž Similar Papers
No similar papers found.