Best of Both Worlds: Regret Minimization versus Minimax Play

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the long-standing open problem in online learning of simultaneously achieving constant regret against a given benchmark policy and √T regret against the hindsight-optimal policy. Focusing on symmetric zero-sum games—both normal-form and extensive-form—it unifies no-regret learning with exploitability. The authors propose the first bandit-feedback algorithm that provably attains both objectives optimally: O(1) regret relative to a prescribed baseline policy and O(√T) regret relative to the best-in-hindsight policy. The algorithm integrates game-theoretic analysis, online convex optimization, and bandit mechanisms. It achieves an optimal trade-off between robustness—suffering at most O(1) loss against adversarial opponents—and adaptivity—gaining Ω(T) reward against exploitable opponents—thereby breaking the performance limitations of conventional no-regret algorithms or minimax strategies.

Technology Category

Application Category

📝 Abstract
In this paper, we investigate the existence of online learning algorithms with bandit feedback that simultaneously guarantee $O(1)$ regret compared to a given comparator strategy, and $O(sqrt{T})$ regret compared to the best strategy in hindsight, where $T$ is the number of rounds. We provide the first affirmative answer to this question. In the context of symmetric zero-sum games, both in normal- and extensive form, we show that our results allow us to guarantee to risk at most $O(1)$ loss while being able to gain $Omega(T)$ from exploitable opponents, thereby combining the benefits of both no-regret algorithms and minimax play.
Problem

Research questions and friction points this paper is trying to address.

Online learning with bandit feedback
Simultaneous regret guarantees
Symmetric zero-sum games optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bandit feedback online learning
O(1) and O(√T) regret guarantees
Combines no-regret and minimax strategies
🔎 Similar Papers
No similar papers found.