Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge in online constrained Markov decision processes (CMDPs) where existing methods struggle to simultaneously achieve sublinear regret in cumulative reward and bounded constraint violation. The authors propose FlexDOME, an algorithm that integrates time-varying safety margins and margin-based regularization within a primal-dual framework. FlexDOME is the first to guarantee last-iterate convergence in a non-asymptotic sense while attaining near-constant cumulative constraint violation ($\tilde{O}(1)$) and sublinear reward regret. The theoretical analysis leverages a policy-dual Lyapunov function combined with asymptotic dominance techniques to establish these guarantees. Empirical evaluations further demonstrate the algorithm’s effectiveness in practice.

Technology Category

Application Category

📝 Abstract

We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant $\tilde{O}(1)$ strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety margin is rigorously scheduled to asymptotically majorize the functional decay rates of the optimization and statistical errors, thereby clamping cumulative violations to a near-constant level. Furthermore, we establish non-asymptotic last-iterate convergence guarantees via a policy-dual Lyapunov argument. Experiments corroborate our theoretical findings.

Problem

Research questions and friction points this paper is trying to address.

Constrained Markov Decision Processes

online reinforcement learning

strong regret

strong constraint violation

last-iterate convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

decaying safety margins

strong constraint violation

last-iterate convergence