🤖 AI Summary
This paper addresses the global convergence of infinite-horizon average-reward constrained Markov decision processes (CMDPs). To tackle the challenges posed by general policy parameterization and unknown mixing time ($ au_{ ext{mix}}$) of the environment, we propose the first natural-policy-gradient primal-dual actor-critic algorithm with provable global convergence guarantees. Our method simultaneously achieves policy optimization and constraint satisfaction without prior knowledge of $ au_{ ext{mix}}$. When $ au_{ ext{mix}}$ is known, it attains $ ilde{O}(1/sqrt{T})$ convergence rate for both optimality gap and constraint violation—matching the fundamental lower bound for unconstrained MDPs. When $ au_{ ext{mix}}$ is unknown, it achieves $ ilde{O}(1/T^{0.5-varepsilon})$ convergence under $T geq ilde{O}( au_{ ext{mix}}^{2/varepsilon})$. To our knowledge, this is the first globally convergent algorithm for average-reward CMDPs that simultaneously achieves tight convergence rates, strict feasibility guarantees, and robustness to unknown mixing time.
📝 Abstract
This paper investigates infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) with general parametrization. We propose a Primal-Dual Natural Actor-Critic algorithm that adeptly manages constraints while ensuring a high convergence rate. In particular, our algorithm achieves global convergence and constraint violation rates of $ ilde{mathcal{O}}(1/sqrt{T})$ over a horizon of length $T$ when the mixing time, $ au_{mathrm{mix}}$, is known to the learner. In absence of knowledge of $ au_{mathrm{mix}}$, the achievable rates change to $ ilde{mathcal{O}}(1/T^{0.5-epsilon})$ provided that $T geq ilde{mathcal{O}}left( au_{mathrm{mix}}^{2/epsilon}
ight)$. Our results match the theoretical lower bound for Markov Decision Processes and establish a new benchmark in the theoretical exploration of average reward CMDPs.