🤖 AI Summary
This paper addresses the exploration-exploitation trade-off in multi-armed bandits by proposing Information-Directed Sampling (IDS), a strategy that jointly optimizes immediate regret and information gain. It extends IDS to the discounted infinite-horizon setting for the first time, introducing a novel information measure and a tunable information-regret trade-off parameter. In a two-state Bernoulli bandit environment, the authors provide rigorous finite-time analysis of IDS’s suboptimality gap relative to the Bayesian optimal policy: it achieves bounded cumulative regret in the symmetric arms setting and attains logarithmic regret—matching the asymptotic theoretical lower bound—in the single fair-coin setting. The analysis integrates information-theoretic tools (KL divergence, mutual information), Bayesian reinforcement learning (discounted MDPs, posterior updates), and statistical asymptotics. These results establish IDS as both theoretically optimal and practically viable, offering a new paradigm for information-driven sequential decision-making.
📝 Abstract
The Multi-Armed Bandit problem provides a fundamental framework for analyzing the tension between exploration and exploitation in sequential learning. This paper explores Information Directed Sampling (IDS) policies, a class of heuristics that balance immediate regret against information gain. We focus on the tractable environment of two-state Bernoulli bandits as a minimal model to rigorously compare heuristic strategies against the optimal policy. We extend the IDS framework to the discounted infinite-horizon setting by introducing a modified information measure and a tuning parameter to modulate the decision-making behavior. We examine two specific problem classes: symmetric bandits and the scenario involving one fair coin. In the symmetric case we show that IDS achieves bounded cumulative regret, whereas in the one-fair-coin scenario the IDS policy yields a regret that scales logarithmically with the horizon, in agreement with classical asymptotic lower bounds. This work serves as a pedagogical synthesis, aiming to bridge concepts from reinforcement learning and information theory for an audience of statistical physicists.