Learning Adversarial MDPs with Stochastic Hard Constraints

📅 2024-03-06
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies online learning for constrained Markov decision processes (CMDPs) under adversarial loss functions and stochastic hard constraints, with feedback limited to bandit-style observations. Addressing non-stationary environments, it is the first to jointly model adversarial objective losses and random hard constraints. The authors propose three novel algorithms grounded in an optimistic game-theoretic framework, leveraging Slater condition analysis, confidence-region construction, and constraint-correction mechanisms. These methods overcome limitations of prior CMDP works—namely, reliance on soft constraints or stationarity assumptions—and achieve sublinear regret across three distinct settings. Notably, the latter two algorithms guarantee constraint satisfaction with high probability at each round and incur only constant cumulative positive constraint violation, respectively. Theoretical analysis further establishes that dependence on the Slater parameter is unavoidable.

Technology Category

Application Category

📝 Abstract
We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that constraints are satisfied at every episode with high probability. In the last scenario, we only assume the existence of a strictly feasible policy, which is not known to the learner, and we design an algorithm attaining sublinear regret and constant cumulative positive constraints violation. Finally, we show that in the last two scenarios, a dependence on the Slater's parameter is unavoidable. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with existing ones, enabling their adoption in a much wider range of applications.
Problem

Research questions and friction points this paper is trying to address.

Online learning in constrained MDPs
Adversarial losses with stochastic constraints
Sublinear regret and constraint violation control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial losses handling
Stochastic hard constraints
Sublinear regret algorithms
🔎 Similar Papers
No similar papers found.