Global Convergence for Average Reward Constrained MDPs with Primal-Dual Actor Critic Algorithm

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the global convergence of infinite-horizon average-reward constrained Markov decision processes (CMDPs). To tackle the challenges posed by general policy parameterization and unknown mixing time ($ au_{ ext{mix}}$) of the environment, we propose the first natural-policy-gradient primal-dual actor-critic algorithm with provable global convergence guarantees. Our method simultaneously achieves policy optimization and constraint satisfaction without prior knowledge of $ au_{ ext{mix}}$. When $ au_{ ext{mix}}$ is known, it attains $ ilde{O}(1/sqrt{T})$ convergence rate for both optimality gap and constraint violation—matching the fundamental lower bound for unconstrained MDPs. When $ au_{ ext{mix}}$ is unknown, it achieves $ ilde{O}(1/T^{0.5-varepsilon})$ convergence under $T geq ilde{O}( au_{ ext{mix}}^{2/varepsilon})$. To our knowledge, this is the first globally convergent algorithm for average-reward CMDPs that simultaneously achieves tight convergence rates, strict feasibility guarantees, and robustness to unknown mixing time.

Technology Category

Application Category

📝 Abstract
This paper investigates infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) with general parametrization. We propose a Primal-Dual Natural Actor-Critic algorithm that adeptly manages constraints while ensuring a high convergence rate. In particular, our algorithm achieves global convergence and constraint violation rates of $ ilde{mathcal{O}}(1/sqrt{T})$ over a horizon of length $T$ when the mixing time, $ au_{mathrm{mix}}$, is known to the learner. In absence of knowledge of $ au_{mathrm{mix}}$, the achievable rates change to $ ilde{mathcal{O}}(1/T^{0.5-epsilon})$ provided that $T geq ilde{mathcal{O}}left( au_{mathrm{mix}}^{2/epsilon} ight)$. Our results match the theoretical lower bound for Markov Decision Processes and establish a new benchmark in the theoretical exploration of average reward CMDPs.
Problem

Research questions and friction points this paper is trying to address.

Investigates infinite-horizon average reward Constrained MDPs with general parametrization
Proposes Primal-Dual Natural Actor-Critic algorithm for constraint management and high convergence
Achieves global convergence and constraint violation rates under known and unknown mixing times
Innovation

Methods, ideas, or system contributions that make the work stand out.

Primal-Dual Natural Actor-Critic algorithm
Global convergence with constraint violation rates
Handles unknown mixing time with adjusted rates
🔎 Similar Papers
No similar papers found.