Learning Markov Decision Processes under Fully Bandit Feedback

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses policy optimization in finite-horizon Markov decision processes (MDPs) under the full-bandit feedback setting, where the agent observes only the total reward per episode without access to state-action trajectories. The paper proposes the first provably efficient learning algorithm for this setting, leveraging a regret minimization framework that integrates value function estimation with confidence interval techniques to enable effective policy learning despite the absence of fine-grained feedback. Theoretical analysis reveals that an exponential dependence on the horizon length $H$ in the regret lower bound is unavoidable in general MDPs. However, for structured—specifically, ordered—MDPs, the algorithm achieves a near-optimal regret bound of $\widetilde{O}(\sqrt{T})$, and demonstrates performance comparable to the UCB-VI algorithm, which assumes full feedback, in $k$-item prophet inequality tasks.

Technology Category

Application Category

📝 Abstract

A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this setting, achieving nearly-tight $\Theta(\sqrt{T})$-regret bounds. However, such detailed feedback can be unrealistic, and recent research has investigated more restricted settings such as trajectory feedback, where the agent observes all the visited state-action pairs, but only a single \emph{aggregate} reward. In this paper, we consider a far more restrictive ``fully bandit''feedback model for episodic MDPs, where the agent does not even observe the visited state-action pairs -- it only learns the aggregate reward. We provide the first efficient bandit learning algorithm for episodic MDPs with $\widetilde{O}(\sqrt{T})$ regret. Our regret has an exponential dependence on the horizon length $\H$, which we show is necessary. We also obtain improved nearly-tight regret bounds for ``ordered''MDPs; these can be used to model classical stochastic optimization problems such as $k$-item prophet inequality and sequential posted pricing. Finally, we evaluate the empirical performance of our algorithm for the setting of $k$-item prophet inequalities; despite the highly restricted feedback, our algorithm's performance is comparable to that of a state-of-art learning algorithm (UCB-VI) with detailed state-action feedback.

Problem

Research questions and friction points this paper is trying to address.

Markov Decision Processes

Fully Bandit Feedback

Reinforcement Learning

Regret Minimization

Episodic MDPs

Innovation

Methods, ideas, or system contributions that make the work stand out.

fully bandit feedback

episodic MDPs

regret minimization