🤖 AI Summary
This work addresses sequential multi-task multi-armed bandits, leveraging inter-task adjacency similarity—i.e., bounded differences in mean rewards between adjacent tasks—to enable reward sample transfer and reduce cumulative regret. We propose the first UCB-based cross-task sample transfer mechanism, designing adaptive algorithms for both known and unknown similarity parameters, and provide rigorous upper bounds on total regret, demonstrating that transfer significantly mitigates regret growth. Our method integrates an enhanced UCB strategy, adjacency-aware similarity modeling, boundary-constrained analysis, and reward sample reuse. Theoretical analysis establishes sublinear regret bounds strictly improving upon standard UCB under similarity assumptions. Empirical evaluations confirm consistent superiority over standard UCB and naive transfer baselines across diverse task sequences. Collectively, our results validate the efficacy and advantage of structurally informed, similarity-driven sample transfer in multi-task bandit learning.
📝 Abstract
We consider a sequential multi-task problem, where each task is modeled as the stochastic multi-armed bandit with K arms. We assume the bandit tasks are adjacently similar in the sense that the difference between the mean rewards of the arms for any two consecutive tasks is bounded by a parameter. We propose two algorithms (one assumes the parameter is known while the other does not) based on UCB to transfer reward samples from preceding tasks to improve the overall regret across all tasks. Our analysis shows that transferring samples reduces the regret as compared to the case of no transfer. We provide empirical results for our algorithms, which show performance improvement over the standard UCB algorithm without transfer and a naive transfer algorithm.