Beyond Slater's Condition in Online CMDPs with Stochastic and Adversarial Constraints

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This paper studies online episodic constrained Markov decision processes (CMDPs) under **mixed stochastic and adversarial constraints**, relaxing the conventional Slater condition (i.e., existence of a strictly feasible policy). We propose an **online algorithm based on mirror descent with dual updates at two time scales**, integrating stochastic gradient estimation and adversarial robustness control. Theoretically, without assuming the Slater condition, our algorithm simultaneously achieves $O(sqrt{T})$ **cumulative regret** and **constraint violation bound**. For adversarial constraints, we further introduce a *strong violation* metric and establish sublinear $alpha$-regret and constraint violation guarantees. Synthetic experiments demonstrate that the algorithm converges rapidly and maintains robustness even when initialized in an infeasible region.

Technology Category

Application Category

📝 Abstract

We study emph{online episodic Constrained Markov Decision Processes} (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, emph{i.e.}, when the constraints are sampled from fixed but unknown distributions, our method achieves $widetilde{mathcal{O}}(sqrt{T})$ regret and constraint violation without relying on Slater's condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of emph{positive} constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, emph{i.e.}, when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater's condition, and achieves sublinear $α$-regret with respect to the emph{unconstrained} optimum, where $α$ is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.

Problem

Research questions and friction points this paper is trying to address.

Achieving sublinear regret without Slater's condition assumption

Handling both stochastic and adversarial constraint environments

Providing guarantees for positive constraint violation metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel algorithm improves best-of-both-worlds CMDP guarantees

Achieves sublinear regret and violation without Slater's condition

Handles both stochastic and adversarial constraint settings

🔎 Similar Papers

Learning Adversarial MDPs with Stochastic Hard Constraints