π€ AI Summary
This work addresses the lack of high-probability optimal regret bounds for policy optimization methods in stochastic contextual multi-armed bandits (CMAB). To bridge the gap between theoretical guarantees and practical applicability, the paper proposes an efficient algorithm that integrates policy optimization with general offline function approximation. The method establishes, for the first time, a high-probability optimal regret bound of $\widetilde{O}(\sqrt{K|\mathcal{A}| \log|\mathcal{F}|})$ for policy optimization in CMAB, where $K$ denotes the number of contexts, $|\mathcal{A}|$ the number of actions, and $|\mathcal{F}|$ the complexity of the policy class. Both theoretical analysis and empirical experiments corroborate the algorithmβs effectiveness and optimality, thereby providing a solid foundation for deploying policy optimization in real-world contextual bandit settings.
π Abstract
We present the first high-probability optimal regret bound for a policy optimization technique applied to the problem of stochastic contextual multi-armed bandit (CMAB) with general offline function approximation. Our algorithm is both efficient and achieves an optimal regret bound of $\widetilde{O}(\sqrt{ K|\mathcal{A}|\log|\mathcal{F}|})$, where $K$ is the number of rounds, $\mathcal{A}$ is the set of arms, and $\mathcal{F}$ is the function class used to approximate the losses. Our results bridge the gap between theory and practice, demonstrating that the widely used policy optimization methods for the contextual bandit problem can achieve a rigorously-proved optimal regret bound. We support our theoretical results with an empirical evaluation of our algorithm.