Trading off rewards and errors in multi-armed bandits

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the inherent tension in multi-armed bandits between maximizing cumulative reward and accurately estimating the mean rewards of individual arms. The study is the first to systematically characterize the fundamental trade-off between “exploration for learning” and “exploration for profit,” proposing a unified algorithmic framework that flexibly interpolates between these two objectives via an adjustable preference parameter. Theoretical analysis establishes matching upper and lower bounds, demonstrating that the proposed algorithm achieves an optimal balance between regret and estimation error. Empirical evaluations confirm that the method effectively reconciles reward accumulation with estimation accuracy across diverse preference settings.

📝 Abstract

In multi-armed bandits, the most-explored arms are the most informative, while reward maximization typically pulls only the best arm. We study the tradeoff between identifying arm means accurately and accumulating reward, and present an algorithm with regret guarantees that interpolates between the two objectives. We provide both upper and lower bounds and validate empirically.

Problem

Research questions and friction points this paper is trying to address.

multi-armed bandits

reward maximization

arm identification

regret

exploration-exploitation tradeoff

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-armed bandits

reward-error tradeoff

regret bounds