๐ค AI Summary
Existing average-reward reinforcement learning algorithms suffer from suboptimal convergence, reliance on mixing or hitting time priors, high iteration complexity, and poor scalability to large or infinite state spaces. To address these issues, we propose a novel natural Actor-Critic framework that integrates multilevel Monte Carlo (MLMC) gradient estimation with variance reduction techniques, under general policy parameterization. Our approach eliminates the need for mixing or hitting time assumptions and establishes the first global convergence guarantee for average-reward RL without such priors. We achieve the optimal-rate convergence bound of $ ilde{O}(1/sqrt{T})$, where the rate is independent of the state-space sizeโenabling scalability to large or infinite domains. Empirical evaluations on average-reward MDPs demonstrate significant reductions in computational overhead and iteration complexity, while improving both scalability and practical applicability.
๐ Abstract
This work examines average-reward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of $ ilde{mathcal{O}}(1/sqrt{T})$ for average-reward Markov Decision Processes (MDPs) (where $T$ is the horizon length), without requiring the knowledge of mixing and hitting times. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.