Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This work addresses the challenge of last-iterate convergence in policy optimization for constrained Markov decision processes (CMDPs) by proposing a general framework based on an inexact augmented Lagrangian method. It is the first to systematically apply the classical augmented Lagrangian approach to CMDPs. The framework efficiently solves subproblems via Projected Q-Ascent (PQA), circumventing the computational and storage overhead associated with mixed policies. It provides global last-iterate convergence guarantees for tabular, log-linear, and nonlinear policy classes. Empirical results demonstrate that the method achieves convergence performance on par with existing algorithms across both discrete and continuous control tasks while supporting complex policy representations.
📝 Abstract
We study policy optimization for infinite-horizon, discounted constrained Markov decision processes (CMDPs). While existing theoretical guarantees typically hold for the mixture policy, deploying such a policy is computationally and memory intensive. This leads to a practical mismatch where a single (last-iterate) policy must be deployed. Recent theoretical works have thus focused on proving last-iterate convergence, but are largely limited to the tabular setting or to algorithmic variants that are rarely used in practice. To address this, we use the classic inexact augmented Lagrangian ($\texttt{AL}$) method from constrained optimization, and propose a general framework with provable last-iterate convergence for CMDPs. We first focus on the tabular setting and propose to solve the $\texttt{AL}$ sub-problem with projected Q-ascent ($\texttt{PQA}$). Combining the theoretical guarantees of $\texttt{PQA}$ and the standard $\texttt{AL}$ analysis enables us to establish global last-iterate convergence. We generalize these results to handle log-linear policies, and demonstrate that an efficient, projected variant of $\texttt{PQA}$ can achieve last-iterate convergence with comparable guarantees as prior work. Finally, we demonstrate that our framework scales to complex non-linear policies, and evaluate it on continuous control tasks.
Problem

Research questions and friction points this paper is trying to address.

constrained MDPs
last-iterate convergence
policy optimization
augmented Lagrangian
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmented Lagrangian
last-iterate convergence
constrained MDPs
policy optimization
projected Q-ascent
🔎 Similar Papers
2022-10-10IEEE Conference on Decision and ControlCitations: 8