An Improved Algorithm for Adversarial Linear Contextual Bandits via Reduction

šŸ“… 2025-08-16
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This paper studies the linear contextual bandit problem under adversarial loss and stochastic action sets, focusing on efficient solutions without prior knowledge of the context distribution or access to a context simulator. We propose a reduction framework that transforms the original problem into a misspecified adversarial linear bandit problem over a fixed action set, and design polynomial-time algorithms. Our work resolves an open problem posed by Liu et al. by achieving $ ilde{O}(mathrm{poly}(d)sqrt{T})$ linear regret for action sets representable by a polynomial number of linear constraints. Without a context simulator, our regret bound is $ ilde{O}(min{d^2sqrt{T}, sqrt{d^3 T log K}})$; with a simulator, it improves to $ ilde{O}(dsqrt{L^star})$. All algorithms run in time polynomial in the dimension $d$, the number of contexts $C$, and the horizon $T$.

Technology Category

Application Category

šŸ“ Abstract
We present an efficient algorithm for linear contextual bandits with adversarial losses and stochastic action sets. Our approach reduces this setting to misspecification-robust adversarial linear bandits with fixed action sets. Without knowledge of the context distribution or access to a context simulator, the algorithm achieves $ ilde{O}(min{d^2sqrt{T}, sqrt{d^3Tlog K}})$ regret and runs in $ ext{poly}(d,C,T)$ time, where $d$ is the feature dimension, $C$ is an upper bound on the number of linear constraints defining the action set in each round, $K$ is an upper bound on the number of actions in each round, and $T$ is number of rounds. This resolves the open question by Liu et al. (2023) on whether one can obtain $ ext{poly}(d)sqrt{T}$ regret in polynomial time independent of the number of actions. For the important class of combinatorial bandits with adversarial losses and stochastic action sets where the action sets can be described by a polynomial number of linear constraints, our algorithm is the first to achieve $ ext{poly}(d)sqrt{T}$ regret in polynomial time, while no prior algorithm achieves even $o(T)$ regret in polynomial time to our knowledge. When a simulator is available, the regret bound can be improved to $ ilde{O}(dsqrt{L^star})$, where $L^star$ is the cumulative loss of the best policy.
Problem

Research questions and friction points this paper is trying to address.

Efficient algorithm for adversarial linear contextual bandits
Reduction to misspecification-robust adversarial linear bandits
Achieves poly(d)√T regret in polynomial time
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reduces adversarial linear contextual bandits to misspecification-robust bandits
Achieves poly(d)√T regret in polynomial time
Improves regret bound with simulator to ƕ(d√L⋆)
šŸ”Ž Similar Papers
No similar papers found.