Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies online learning of finite-horizon Markov decision processes (MDPs) under aggregate bandit feedback, unifying treatment of stochastic and adversarial environments. The proposed algorithm—built upon the Follow-the-Regularized-Leader (FTRL) framework over occupancy measures—is the first to achieve theoretically optimal regret bounds in both settings: $O(log T)$ in stochastic environments and $O(sqrt{T})$ in adversarial ones when transitions are known; tight upper bounds persist even with unknown transitions. Key methodological innovations include self-bounding techniques, a novel unbiased loss estimator for aggregate feedback, and a confidence-aware mechanism. The work establishes the first matching upper and lower bounds for aggregate-feedback MDPs and provides the first gap-dependent lower bound for the shortest-path problem under this feedback model. Collectively, it achieves optimal performance across both stochastic and adversarial worlds.

Technology Category

Application Category

📝 Abstract
We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve $O(log T)$ regret in stochastic settings and ${O}(sqrt{T})$ regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknown-transition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.
Problem

Research questions and friction points this paper is trying to address.

Studying episodic MDPs with aggregate bandit feedback limitations
Developing algorithms for both stochastic and adversarial environments
Establishing optimal regret bounds with known and unknown transitions
Innovation

Methods, ideas, or system contributions that make the work stand out.

First BOBW algorithms for episodic MDPs
FTRL over occupancy measures with self-bounding
New loss estimators from online shortest paths
🔎 Similar Papers
No similar papers found.