🤖 AI Summary
This paper studies online episodic constrained Markov decision processes (CMDPs) under **mixed stochastic and adversarial constraints**, relaxing the conventional Slater condition (i.e., existence of a strictly feasible policy). We propose an **online algorithm based on mirror descent with dual updates at two time scales**, integrating stochastic gradient estimation and adversarial robustness control. Theoretically, without assuming the Slater condition, our algorithm simultaneously achieves $O(sqrt{T})$ **cumulative regret** and **constraint violation bound**. For adversarial constraints, we further introduce a *strong violation* metric and establish sublinear $alpha$-regret and constraint violation guarantees. Synthetic experiments demonstrate that the algorithm converges rapidly and maintains robustness even when initialized in an infeasible region.
📝 Abstract
We study emph{online episodic Constrained Markov Decision Processes} (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, emph{i.e.}, when the constraints are sampled from fixed but unknown distributions, our method achieves $widetilde{mathcal{O}}(sqrt{T})$ regret and constraint violation without relying on Slater's condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of emph{positive} constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, emph{i.e.}, when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater's condition, and achieves sublinear $α$-regret with respect to the emph{unconstrained} optimum, where $α$ is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.