Best of Both Worlds: Regret Minimization versus Minimax Play

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This paper addresses the long-standing open problem in online learning of simultaneously achieving constant regret against a given benchmark policy and √T regret against the hindsight-optimal policy. Focusing on symmetric zero-sum games—both normal-form and extensive-form—it unifies no-regret learning with exploitability. The authors propose the first bandit-feedback algorithm that provably attains both objectives optimally: O(1) regret relative to a prescribed baseline policy and O(√T) regret relative to the best-in-hindsight policy. The algorithm integrates game-theoretic analysis, online convex optimization, and bandit mechanisms. It achieves an optimal trade-off between robustness—suffering at most O(1) loss against adversarial opponents—and adaptivity—gaining Ω(T) reward against exploitable opponents—thereby breaking the performance limitations of conventional no-regret algorithms or minimax strategies.

Technology Category

Application Category

📝 Abstract

In this paper, we investigate the existence of online learning algorithms with bandit feedback that simultaneously guarantee $O(1)$ regret compared to a given comparator strategy, and $O(sqrt{T})$ regret compared to the best strategy in hindsight, where $T$ is the number of rounds. We provide the first affirmative answer to this question. In the context of symmetric zero-sum games, both in normal- and extensive form, we show that our results allow us to guarantee to risk at most $O(1)$ loss while being able to gain $Omega(T)$ from exploitable opponents, thereby combining the benefits of both no-regret algorithms and minimax play.

Problem

Research questions and friction points this paper is trying to address.

Online learning with bandit feedback

Simultaneous regret guarantees

Symmetric zero-sum games optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bandit feedback online learning

O(1) and O(√T) regret guarantees

Combines no-regret and minimax strategies

🔎 Similar Papers

No similar papers found.