🤖 AI Summary
This paper addresses the challenge of online automatic tuning of critical hyperparameters—such as the exploration-exploitation trade-off coefficient—in stochastic multi-armed bandits. We propose a meta-learning framework leveraging offline historical data: it transfers knowledge from cross-task offline datasets—collected under an unknown task distribution—to rapidly identify near-optimal hyperparameters for new tasks. We establish, for the first time, unified sample complexity bounds applicable to both cross-task and single-task settings, covering UCB, LinUCB, GP-UCB, and related algorithms. Our approach models distributional shift, employs empirical risk minimization, and conducts rigorous generalization error analysis to ensure reliable mapping from offline statistics to online decision-making. Theoretical analysis guarantees convergence to near-optimal hyperparameters. Experiments demonstrate significant reduction in cumulative regret and improved robustness and adaptation speed on both synthetic and real-world datasets.
📝 Abstract
Classic algorithms for stochastic bandits typically use hyperparameters that govern their critical properties such as the trade-off between exploration and exploitation. Tuning these hyperparameters is a problem of great practical significance. However, this is a challenging problem and in certain cases is information theoretically impossible. To address this challenge, we consider a practically relevant transfer learning setting where one has access to offline data collected from several bandit problems (tasks) coming from an unknown distribution over the tasks. Our aim is to use this offline data to set the hyperparameters for a new task drawn from the unknown distribution. We provide bounds on the inter-task (number of tasks) and intra-task (number of arm pulls for each task) sample complexity for learning near-optimal hyperparameters on unseen tasks drawn from the distribution. Our results apply to several classic algorithms, including tuning the exploration parameters in UCB and LinUCB and the noise parameter in GP-UCB. Our experiments indicate the significance and effectiveness of the transfer of hyperparameters from offline problems in online learning with stochastic bandit feedback.