Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

📅 2022-06-06
🏛️ arXiv.org
📈 Citations: 22
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the optimal control of discounted constrained Markov decision processes (CMDPs) with long-term utility constraints, aiming to maximize the expected cumulative reward subject to an average utility constraint. We propose the natural policy gradient primal–dual (NPG-PD) algorithm, which jointly performs natural policy gradient updates and projected subgradient updates of the Lagrange multiplier. We establish, for the first time, global convergence of NPG-PD under softmax policies with dimension-free rates. The analysis is extended to log-linear and general smooth policy classes, yielding convergence rates that explicitly account for function approximation error. Moreover, we provide the first sample complexity upper bound for CMDPs under natural policy gradients. Theoretically, both the optimality gap and constraint violation decay at $O(1/sqrt{T})$. Empirical results corroborate the algorithm’s effectiveness and robustness across benchmark domains.
📝 Abstract
We study sequential decision making problems aimed at maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected sub-gradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. Finally, we use computational experiments to showcase the merits and the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Maximizing expected total reward under utility constraints in MDPs
Analyzing convergence rates for optimality gap and constraint violations
Establishing dimension-free sample complexity for policy gradient methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Natural Policy Gradient Primal-Dual method for constrained MDPs
Global convergence with sublinear rates for optimality gap
Dimension-free convergence independent of state-action space
🔎 Similar Papers
No similar papers found.