Regret Analysis of Average-Reward Unichain MDPs via an Actor-Critic Approach

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This paper studies online learning for infinite-horizon average-reward Markov decision processes (MDPs) under a single-chain assumption—strictly weaker than conventional ergodicity, permitting transient states and periodicity. We propose NAC-B, a natural Actor-Critic algorithm integrating natural policy gradients, function approximation via actor-critic neural networks, batched updates, and refined Markov chain mixing analysis. Crucially, we introduce two novel mixing-time metrics, $C_{ ext{hit}}$ and $C_{ ext{tar}}$, to quantify the convergence rate of empirical averages, revealing how batching mitigates bias and variance induced by periodicity. Theoretically, NAC-B achieves the optimal $ ilde{O}(sqrt{T})$ regret bound—the first such result under the single-chain assumption—thereby relaxing ergodicity requirements while preserving rigorous guarantees and scalability to large state-action spaces.

Technology Category

Application Category

📝 Abstract

Actor-Critic methods are widely used for their scalability, yet existing theoretical guarantees for infinite-horizon average-reward Markov Decision Processes (MDPs) often rely on restrictive ergodicity assumptions. We propose NAC-B, a Natural Actor-Critic with Batching, that achieves order-optimal regret of $ ilde{O}(sqrt{T})$ in infinite-horizon average-reward MDPs under the unichain assumption, which permits both transient states and periodicity. This assumption is among the weakest under which the classic policy gradient theorem remains valid for average-reward settings. NAC-B employs function approximation for both the actor and the critic, enabling scalability to problems with large state and action spaces. The use of batching in our algorithm helps mitigate potential periodicity in the MDP and reduces stochasticity in gradient estimates, and our analysis formalizes these benefits through the introduction of the constants $C_{ ext{hit}}$ and $C_{ ext{tar}}$, which characterize the rate at which empirical averages over Markovian samples converge to the stationary distribution.

Problem

Research questions and friction points this paper is trying to address.

Analyzes regret in average-reward MDPs under unichain assumption

Proposes scalable actor-critic method with function approximation

Addresses periodicity and transient states via batching technique

Innovation

Methods, ideas, or system contributions that make the work stand out.

Natural Actor-Critic with Batching (NAC-B)

Function approximation for scalability

Batching reduces gradient estimate stochasticity

🔎 Similar Papers

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs