🤖 AI Summary
This work addresses zero-sum games with bandit feedback, where agents face unknown opponents and observe only limited feedback. While the classical external regret lower bound is Ω(√T), the paper proposes optimizing pure-strategy maximin regret instead. It introduces adaptive algorithms—Maximin-UCB and Tsallis-INF—under both uninformed and informed feedback models by integrating Tsallis entropy regularization, linear reward modeling, and frameworks from Follow-the-Regularized-Leader (FTRL) and Upper Confidence Bound (UCB). These algorithms achieve instance-dependent logarithmic regret bounds of O(c log T) and O(c' log T), respectively, with c' ≪ c. The study further establishes an information-theoretic lower bound showing that the dependence on c is unavoidable and extends the results to bilinear games with large action spaces.
📝 Abstract
Learning to play zero-sum games is a fundamental problem in game theory and machine learning. While significant progress has been made in minimizing external regret in the self-play settings or with full-information feedback, real-world applications often force learners to play against unknown, arbitrary opponents and restrict learners to bandit feedback where only the payoff of the realized action is observable. In such challenging settings, it is well-known that $\Omega(\sqrt{T})$ external regret is unavoidable (where T is the number of rounds). To overcome this barrier, we investigate adversarial learning in zero-sum games under bandit feedback, aiming to minimize the deficit against the maximin pure strategy -- a metric we term Pure-Strategy Maximin Regret. We analyze this problem under two bandit feedback models: uninformed (only the realized reward is revealed) and informed (both the reward and the opponent's action are revealed). For uninformed bandit learning of normal-form games, we show that the Tsallis-INF algorithm achieves $O(c \log T)$ instance-dependent regret with a game-dependent parameter $c$. Crucially, we prove an information-theoretic lower bound showing that the dependence on c is necessary. To overcome this hardness, we turn to the informed setting and introduce Maximin-UCB, which obtains another regret bound of the form $O(c'\log T)$ for a different game-dependent parameter $c'$ that could potentially be much smaller than $c$. Finally, we generalize both results to bilinear games over an arbitrary, large action set, proposing Tsallis-FTRL-SPM and Maximin-LinUCB for the uninformed and informed setting respectively and establishing similar game-dependent logarithmic regret bounds.