Global Convergence for Average Reward Constrained MDPs with Primal-Dual Actor Critic Algorithm

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This paper addresses the global convergence of infinite-horizon average-reward constrained Markov decision processes (CMDPs). To tackle the challenges posed by general policy parameterization and unknown mixing time ($ au_{ ext{mix}}$) of the environment, we propose the first natural-policy-gradient primal-dual actor-critic algorithm with provable global convergence guarantees. Our method simultaneously achieves policy optimization and constraint satisfaction without prior knowledge of $ au_{ ext{mix}}$. When $ au_{ ext{mix}}$ is known, it attains $ ilde{O}(1/sqrt{T})$ convergence rate for both optimality gap and constraint violation—matching the fundamental lower bound for unconstrained MDPs. When $ au_{ ext{mix}}$ is unknown, it achieves $ ilde{O}(1/T^{0.5-varepsilon})$ convergence under $T geq ilde{O}( au_{ ext{mix}}^{2/varepsilon})$. To our knowledge, this is the first globally convergent algorithm for average-reward CMDPs that simultaneously achieves tight convergence rates, strict feasibility guarantees, and robustness to unknown mixing time.

Technology Category

Application Category

📝 Abstract

This paper investigates infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) with general parametrization. We propose a Primal-Dual Natural Actor-Critic algorithm that adeptly manages constraints while ensuring a high convergence rate. In particular, our algorithm achieves global convergence and constraint violation rates of $ ilde{mathcal{O}}(1/sqrt{T})$ over a horizon of length $T$ when the mixing time, $ au_{mathrm{mix}}$, is known to the learner. In absence of knowledge of $ au_{mathrm{mix}}$, the achievable rates change to $ ilde{mathcal{O}}(1/T^{0.5-epsilon})$ provided that $T geq ilde{mathcal{O}}left( au_{mathrm{mix}}^{2/epsilon} ight)$. Our results match the theoretical lower bound for Markov Decision Processes and establish a new benchmark in the theoretical exploration of average reward CMDPs.

Problem

Research questions and friction points this paper is trying to address.

Investigates infinite-horizon average reward Constrained MDPs with general parametrization

Proposes Primal-Dual Natural Actor-Critic algorithm for constraint management and high convergence

Achieves global convergence and constraint violation rates under known and unknown mixing times

Innovation

Methods, ideas, or system contributions that make the work stand out.

Primal-Dual Natural Actor-Critic algorithm

Global convergence with constraint violation rates

Handles unknown mixing time with adjusted rates

🔎 Similar Papers

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs