🤖 AI Summary
This paper addresses the fundamental problem of robust safety control under uncertainty by studying Robust Constrained Markov Decision Processes (RCMDPs), which seek near-optimal policies that satisfy safety constraints and minimize cumulative cost under the worst-case model uncertainty. To overcome key limitations of conventional Lagrangian-based approaches—such as susceptibility to local optima and gradient conflicts between objective and constraint objectives—we propose a novel epigraph reformulation of the RCMDP and design a hybrid solution framework integrating robust policy gradients with binary search. Theoretically, we prove that our algorithm converges to an ε-optimal feasible policy within Õ(ε⁻⁴) robust policy evaluations—a first-of-its-kind result providing both an explicit convergence rate and provable near-optimality guarantees for RCMDPs.
📝 Abstract
Designing a safe policy for uncertain environments is crucial in real-world control systems. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm guaranteed to identify a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional policy gradient approach to the Lagrangian max-min formulation can become trapped in suboptimal solutions. This occurs when its inner minimization encounters a sum of conflicting gradients from the objective and constraint functions. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the constraints. Building on the epigraph form, we propose a bisection search algorithm with a policy gradient subroutine and prove that it identifies an $varepsilon$-optimal policy in an RCMDP with $widetilde{mathcal{O}}(varepsilon^{-4})$ robust policy evaluations.