Optimal Regret for Policy Optimization in Contextual Bandits

📅 2026-02-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of high-probability optimal regret bounds for policy optimization methods in stochastic contextual multi-armed bandits (CMAB). To bridge the gap between theoretical guarantees and practical applicability, the paper proposes an efficient algorithm that integrates policy optimization with general offline function approximation. The method establishes, for the first time, a high-probability optimal regret bound of $\widetilde{O}(\sqrt{K|\mathcal{A}| \log|\mathcal{F}|})$ for policy optimization in CMAB, where $K$ denotes the number of contexts, $|\mathcal{A}|$ the number of actions, and $|\mathcal{F}|$ the complexity of the policy class. Both theoretical analysis and empirical experiments corroborate the algorithm’s effectiveness and optimality, thereby providing a solid foundation for deploying policy optimization in real-world contextual bandit settings.

Technology Category

Application Category

📝 Abstract

We present the first high-probability optimal regret bound for a policy optimization technique applied to the problem of stochastic contextual multi-armed bandit (CMAB) with general offline function approximation. Our algorithm is both efficient and achieves an optimal regret bound of $\widetilde{O}(\sqrt{ K|\mathcal{A}|\log|\mathcal{F}|})$, where $K$ is the number of rounds, $\mathcal{A}$ is the set of arms, and $\mathcal{F}$ is the function class used to approximate the losses. Our results bridge the gap between theory and practice, demonstrating that the widely used policy optimization methods for the contextual bandit problem can achieve a rigorously-proved optimal regret bound. We support our theoretical results with an empirical evaluation of our algorithm.

Problem

Research questions and friction points this paper is trying to address.

contextual bandits

policy optimization

regret bound

function approximation

stochastic multi-armed bandit

Innovation

Methods, ideas, or system contributions that make the work stand out.

policy optimization

contextual bandits

optimal regret

function approximation

stochastic multi-armed bandit

🔎 Similar Papers

No similar papers found.

Authors to Follow