Finite-Time Regret Analysis of Retry-Aware Bandits

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the previously unclear regret properties of the ReMax algorithm under retry-aware objectives such as pass@$k$ and max@$k$. Focusing on the Gaussian reward setting with two arms, the study characterizes the optimal sampling distribution via an expected improvement balancing condition and disentangles the effects of suboptimal arm saturation and underestimation of the optimal arm to analyze finite-time regret. It establishes, for the first time, a sublinear regret bound for ReMax, revealing that ReMax is more exploitative than Thompson sampling and highlighting the distinctive role of underestimation in shaping the exploration–exploitation trade-off. Theoretically, sublinear regret is guaranteed under mild underestimation conditions, and experiments demonstrate that ReMax consistently outperforms both KL-UCB and Thompson sampling across most scenarios.

📝 Abstract

We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over $M$ virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems have remained unclear. For Gaussian rewards and the first nontrivial case $M=2$, we characterize the optimal ReMax distribution through an expected-improvement balance condition and prove the first sublinear regret bound for ReMax. Our analysis separates the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect, in which the optimal arm may be sampled too rarely after an unfavorable estimate. This explains why ReMax can be more exploitative than Thompson sampling (TS) and why its regret analysis is technically delicate. Experiments support this picture: ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation, while posterior-variance scaling empirically mitigates severe underestimation.

Problem

Research questions and friction points this paper is trying to address.

finite-time regret

retry-aware bandits

ReMax

stochastic bandits

multi-attempt optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReMax

finite-time regret

expected-improvement balance