Nearly Tight Bounds for Cross-Learning Contextual Bandits with Graphical Feedback

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the multi-armed bandit problem under graph-structured feedback across learning contexts, addressing the long-standing open challenge of removing dependence on the number of contexts in regret bounds. We propose a novel UCB-based framework that jointly incorporates cross-context information aggregation, an adaptive exploration strategy parameterized by the graph’s independence number α, and neighborhood feedback propagation modeling. For the first time, we establish a tight minimax regret bound of Õ(√(αT)) under stochastic contexts—completely eliminating dependence on context cardinality. Surprisingly, this bound also holds under adversarial contexts, surpassing prior theoretical limitations. The result matches the information-theoretic lower bound, yielding optimal theoretical guarantees for applications such as real-time ad auctions and dynamic pricing.

Technology Category

Application Category

📝 Abstract
The cross-learning contextual bandit problem with graphical feedback has recently attracted significant attention. In this setting, there is a contextual bandit with a feedback graph over the arms, and pulling an arm reveals the loss for all neighboring arms in the feedback graph across all contexts. Initially proposed by Han et al. (2024), this problem has broad applications in areas such as bidding in first price auctions, and explores a novel frontier in the feedback structure of bandit problems. A key theoretical question is whether an algorithm with $widetilde{O}(sqrt{alpha T})$ regret exists, where $alpha$ represents the independence number of the feedback graph. This question is particularly interesting because it concerns whether an algorithm can achieve a regret bound entirely independent of the number of contexts and matching the minimax regret of vanilla graphical bandits. Previous work has demonstrated that such an algorithm is impossible for adversarial contexts, but the question remains open for stochastic contexts. In this work, we affirmatively answer this open question by presenting an algorithm that achieves the minimax $widetilde{O}(sqrt{alpha T})$ regret for cross-learning contextual bandits with graphical feedback and stochastic contexts. Notably, although that question is open even for stochastic bandits, we directly solve the strictly stronger adversarial bandit version of the problem.
Problem

Research questions and friction points this paper is trying to address.

Minimax regret algorithm for contextual bandits
Graphical feedback across stochastic contexts
Independent of number of contexts in feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-learning contextual bandits
Graphical feedback utilization
Minimax regret algorithm
🔎 Similar Papers
No similar papers found.