Trading off rewards and errors in multi-armed bandits

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses the inherent tension in multi-armed bandits between maximizing cumulative reward and accurately estimating the mean rewards of individual arms. The study is the first to systematically characterize the fundamental trade-off between “exploration for learning” and “exploration for profit,” proposing a unified algorithmic framework that flexibly interpolates between these two objectives via an adjustable preference parameter. Theoretical analysis establishes matching upper and lower bounds, demonstrating that the proposed algorithm achieves an optimal balance between regret and estimation error. Empirical evaluations confirm that the method effectively reconciles reward accumulation with estimation accuracy across diverse preference settings.
📝 Abstract
In multi-armed bandits, the most-explored arms are the most informative, while reward maximization typically pulls only the best arm. We study the tradeoff between identifying arm means accurately and accumulating reward, and present an algorithm with regret guarantees that interpolates between the two objectives. We provide both upper and lower bounds and validate empirically.
Problem

Research questions and friction points this paper is trying to address.

multi-armed bandits
reward maximization
arm identification
regret
exploration-exploitation tradeoff
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-armed bandits
reward-error tradeoff
regret bounds
arm identification
interpolation algorithm
🔎 Similar Papers
No similar papers found.