Multi-Armed Bandits With Best-Action Queries

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This study addresses the multi-armed bandit problem augmented with an oracle that can reveal the optimal action at a cost, investigating whether such query capability reduces regret under standard bandit feedback where only the reward of the chosen action is observed. By integrating information-theoretic lower bounds, stochastic process analysis, and adaptive algorithm design, the work provides the first complete characterization of the value of this querying mechanism and uncovers fundamental differences between adversarial or correlated environments and i.i.d. settings. The main contributions include establishing a regret lower bound of Ω(√(T−k)) in adversarial or correlated environments, and achieving matching upper and lower bounds of Õ(min{T/k, √(T−k)}) in the i.i.d. case, thereby rigorously quantifying how the number of queries k fundamentally governs learning performance.

📝 Abstract

We study \emph{multi-armed bandits} (MABs) augmented with \emph{best-action queries}, in which the learner may additionally query an oracle that reveals the best arm in the current round. This setting was recently characterized by Russo et al. [2024] in the \emph{full-feedback} model, where the learner observes the rewards of all arms after each round. They show that, in both \emph{stochastic} and \emph{adversarial} environments, $k$ best-action queries reduce the optimal $\widetilde{\mathcal{O}}(\sqrt{T})$ regret to $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T}\})$. Whether this improvement extends to the more realistic \emph{bandit-feedback} model -- where the learner observes only the reward of the played arm -- was left as an open problem. We fully resolve this question. When rewards are stochastic but correlated among arms, we show that the full-feedback result does not carry over: any algorithm must incur regret at least $Ω(\sqrt{T-k})$. This lower bound directly extends to adversarial environments. On the positive side, we show that $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T-k}\})$ regret is still achievable when rewards are stochastic and i.i.d., and establish a matching lower bound, up to logarithmic factors. Together, these results provide a complete characterization of the benefits of \emph{best-action queries} in the \emph{bandit-feedback} model.

Problem

Research questions and friction points this paper is trying to address.

multi-armed bandits

best-action queries

bandit feedback

regret minimization

stochastic rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-armed bandits

best-action queries

bandit feedback