Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

This paper studies online finite-horizon MDPs under adversarial environments with aggregate feedback—where only the total loss over each trajectory is observed. Addressing the challenge that intermediate-state losses are inaccessible under full-bandit feedback, we propose the first policy-optimization-based online learning framework for this setting, integrating gradient estimation, loss reconstruction, and confidence-region updates. Under known transition dynamics, our algorithm achieves the optimal regret bound $widetilde{Theta}(H^2sqrt{SAK})$; under unknown dynamics, it attains $widetilde{O}(H^3 S sqrt{AK})$, improving upon the previous best bound by a factor of $H^2 S^5 A^2$. Our key contribution is the first theoretical foundation for policy optimization in aggregate-feedback MDPs, significantly advancing both regret upper bounds and algorithmic practicality.

Technology Category

Application Category

📝 Abstract

We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first extit{optimal} regret bound of $ ilde Theta(H^2sqrt{SAK})$, where $K$ is the number of episodes, $H$ is the episode horizon, $S$ is the number of states, and $A$ is the number of actions. In the unknown dynamics case we establish regret bound of $ ilde O(H^3 S sqrt{AK})$, significantly improving the best known result by a factor of $H^2 S^5 A^2$.

Problem

Research questions and friction points this paper is trying to address.

Online MDPs with aggregate bandit feedback

Policy Optimization algorithms introduction

Optimal regret bounds in known-dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Optimization algorithms

Aggregate Bandit Feedback

Optimal regret bound

🔎 Similar Papers

No similar papers found.