Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies online learning in constrained Markov decision processes (CMDPs) where both rewards and constraints exhibit stochasticity or adversariality, under the more practical bandit-feedback setting. Existing optimal algorithms rely on full feedback and inefficient convex optimization over occupancy measures. To address this, we propose the first best-of-both-worlds policy optimization algorithm for bandit feedback. Our method abandons occupancy measure modeling and instead adopts an efficient policy-gradient framework coupled with an adaptive constraint-handling mechanism. Under stochastic constraints, it achieves $ ilde{O}(sqrt{T})$ regret and constraint violation; under adversarial constraints, it attains $ ilde{O}(sqrt{T})$ constraint violation and near-optimal reward performance. This is the first algorithm for CMDPs that achieves optimal learning in mixed stochastic-adversarial environments using only bandit feedback—significantly improving both computational efficiency and practical applicability.

Technology Category

Application Category

📝 Abstract
We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, Stradi et al.(2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves $widetilde{mathcal{O}}(sqrt{T})$ regret and constraint violation, while, when they are adversarial, it attains $widetilde{mathcal{O}}(sqrt{T})$ constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.
Problem

Research questions and friction points this paper is trying to address.

online learning in CMDPs
best-of-both-worlds algorithm
bandit feedback handling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Best-of-both-worlds algorithm
Bandit feedback optimization
Policy optimization approach
🔎 Similar Papers
No similar papers found.