🤖 AI Summary
This paper studies the multi-armed bandit problem under graph-structured feedback across learning contexts, addressing the long-standing open challenge of removing dependence on the number of contexts in regret bounds. We propose a novel UCB-based framework that jointly incorporates cross-context information aggregation, an adaptive exploration strategy parameterized by the graph’s independence number α, and neighborhood feedback propagation modeling. For the first time, we establish a tight minimax regret bound of Õ(√(αT)) under stochastic contexts—completely eliminating dependence on context cardinality. Surprisingly, this bound also holds under adversarial contexts, surpassing prior theoretical limitations. The result matches the information-theoretic lower bound, yielding optimal theoretical guarantees for applications such as real-time ad auctions and dynamic pricing.
📝 Abstract
The cross-learning contextual bandit problem with graphical feedback has recently attracted significant attention. In this setting, there is a contextual bandit with a feedback graph over the arms, and pulling an arm reveals the loss for all neighboring arms in the feedback graph across all contexts. Initially proposed by Han et al. (2024), this problem has broad applications in areas such as bidding in first price auctions, and explores a novel frontier in the feedback structure of bandit problems. A key theoretical question is whether an algorithm with $widetilde{O}(sqrt{alpha T})$ regret exists, where $alpha$ represents the independence number of the feedback graph. This question is particularly interesting because it concerns whether an algorithm can achieve a regret bound entirely independent of the number of contexts and matching the minimax regret of vanilla graphical bandits. Previous work has demonstrated that such an algorithm is impossible for adversarial contexts, but the question remains open for stochastic contexts. In this work, we affirmatively answer this open question by presenting an algorithm that achieves the minimax $widetilde{O}(sqrt{alpha T})$ regret for cross-learning contextual bandits with graphical feedback and stochastic contexts. Notably, although that question is open even for stochastic bandits, we directly solve the strictly stronger adversarial bandit version of the problem.