Optimal Regret for Policy Optimization in Contextual Bandits

πŸ“… 2026-02-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of high-probability optimal regret bounds for policy optimization methods in stochastic contextual multi-armed bandits (CMAB). To bridge the gap between theoretical guarantees and practical applicability, the paper proposes an efficient algorithm that integrates policy optimization with general offline function approximation. The method establishes, for the first time, a high-probability optimal regret bound of $\widetilde{O}(\sqrt{K|\mathcal{A}| \log|\mathcal{F}|})$ for policy optimization in CMAB, where $K$ denotes the number of contexts, $|\mathcal{A}|$ the number of actions, and $|\mathcal{F}|$ the complexity of the policy class. Both theoretical analysis and empirical experiments corroborate the algorithm’s effectiveness and optimality, thereby providing a solid foundation for deploying policy optimization in real-world contextual bandit settings.

Technology Category

Application Category

πŸ“ Abstract
We present the first high-probability optimal regret bound for a policy optimization technique applied to the problem of stochastic contextual multi-armed bandit (CMAB) with general offline function approximation. Our algorithm is both efficient and achieves an optimal regret bound of $\widetilde{O}(\sqrt{ K|\mathcal{A}|\log|\mathcal{F}|})$, where $K$ is the number of rounds, $\mathcal{A}$ is the set of arms, and $\mathcal{F}$ is the function class used to approximate the losses. Our results bridge the gap between theory and practice, demonstrating that the widely used policy optimization methods for the contextual bandit problem can achieve a rigorously-proved optimal regret bound. We support our theoretical results with an empirical evaluation of our algorithm.
Problem

Research questions and friction points this paper is trying to address.

contextual bandits
policy optimization
regret bound
function approximation
stochastic multi-armed bandit
Innovation

Methods, ideas, or system contributions that make the work stand out.

policy optimization
contextual bandits
optimal regret
function approximation
stochastic multi-armed bandit
πŸ”Ž Similar Papers
No similar papers found.
O
Orin Levy
School of Computer Science and AI, Tel-Aviv University, Tel-Aviv, Israel
Yishay Mansour
Yishay Mansour
Tel Aviv University
machine learningreinforcement learningalgorithmic game theory