On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of leveraging an initial budget of free exploration to minimize cumulative regret in multi-armed bandits, a setting that extends beyond the classical paradigms of pure regret minimization and pure exploration. The authors formalize a regret minimization framework with free exploration and introduce the class of $(\alpha,\beta)$-probability-saving strategies, revealing a sharp phase transition in regret as a function of exploration budget. They propose the UFE-KLUCB-H algorithm, which integrates principled free exploration (UFE) with a history-aware KLUCB-H regret minimization approach and employs a multi-instance perturbation argument technique. Instance-dependent upper and lower bounds are established, proving that the algorithm is nearly optimal for Bernoulli bandits and strictly outperforms strategies without free exploration. Simulations further confirm substantial regret reduction achieved through forced exploration and adaptive mechanisms.
📝 Abstract
We study a stochastic multi-armed bandit problem where an agent is granted a free exploration budget before regret accumulates, a setting not captured by the classic regret minimization or pure exploration paradigms. The goal is to design an adaptive policy that strategically explores the bandit instance in the initial free exploration phase and minimizes the cumulative regret in the subsequent phase. We formalize this regret minimization with free exploration problem and identify an interesting regime where the free exploration budget scales logarithmically with the time horizon. To quantify the amount of regret saved with high probability as a result of the availability of the free exploration phase, we introduce a novel set of policies known as $(α,β)$-probably saving policies. We propose a two-phase, probably saving algorithm, UFE-KLUCB-H, which consists of a principled free exploration policy, UFE, and a history-aware regret minimization policy KLUCB-H. Instance-dependent upper bounds on UFE-KLUCB-H are derived, showing that UFE-KLUCB-H accumulates strictly less regret than policies that do not have access to a free exploration phase. Complementarily, we derive instance-dependent lower bounds based on novel multi-instance perturbation arguments tailored to the free-exploration setting, demonstrating the near-optimality of UFE-KLUCB-H for two-valued bandits. Our upper and lower bounds reveal sharp phase transitions in the accumulated regret depending on the amount of available free exploration. Simulations are conducted to demonstrate that forced exploration and adaptivity in the algorithm lead to greater regret savings.
Problem

Research questions and friction points this paper is trying to address.

multi-armed bandits
regret minimization
free exploration
stochastic bandits
exploration-exploitation tradeoff
Innovation

Methods, ideas, or system contributions that make the work stand out.

free exploration
regret minimization
multi-armed bandits
instance-dependent bounds
adaptive policy
🔎 Similar Papers
No similar papers found.