Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the challenge of last-iterate convergence in policy optimization for constrained Markov decision processes (CMDPs) by proposing a general framework based on an inexact augmented Lagrangian method. It is the first to systematically apply the classical augmented Lagrangian approach to CMDPs. The framework efficiently solves subproblems via Projected Q-Ascent (PQA), circumventing the computational and storage overhead associated with mixed policies. It provides global last-iterate convergence guarantees for tabular, log-linear, and nonlinear policy classes. Empirical results demonstrate that the method achieves convergence performance on par with existing algorithms across both discrete and continuous control tasks while supporting complex policy representations.

📝 Abstract

We study policy optimization for infinite-horizon, discounted constrained Markov decision processes (CMDPs). While existing theoretical guarantees typically hold for the mixture policy, deploying such a policy is computationally and memory intensive. This leads to a practical mismatch where a single (last-iterate) policy must be deployed. Recent theoretical works have thus focused on proving last-iterate convergence, but are largely limited to the tabular setting or to algorithmic variants that are rarely used in practice. To address this, we use the classic inexact augmented Lagrangian ($\texttt{AL}$) method from constrained optimization, and propose a general framework with provable last-iterate convergence for CMDPs. We first focus on the tabular setting and propose to solve the $\texttt{AL}$ sub-problem with projected Q-ascent ($\texttt{PQA}$). Combining the theoretical guarantees of $\texttt{PQA}$ and the standard $\texttt{AL}$ analysis enables us to establish global last-iterate convergence. We generalize these results to handle log-linear policies, and demonstrate that an efficient, projected variant of $\texttt{PQA}$ can achieve last-iterate convergence with comparable guarantees as prior work. Finally, we demonstrate that our framework scales to complex non-linear policies, and evaluate it on continuous control tasks.

Problem

Research questions and friction points this paper is trying to address.

constrained MDPs

last-iterate convergence

policy optimization

augmented Lagrangian

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmented Lagrangian

last-iterate convergence

constrained MDPs