🤖 AI Summary
This work addresses the long-standing open problem of minimizing dynamic regret against arbitrary sequences of dynamic comparators in unconstrained adversarial linear bandits when only pointwise loss feedback is available. The paper proposes an adaptive algorithmic framework that does not require prior knowledge of the number of comparator switches, achieving—for the first time in linear bandits—an optimal dynamic regret bound with respect to any number of switches $S_T$. Building upon adaptive ensembling techniques from multi-armed bandits and incorporating a parameter-free design alongside refined analysis of dynamic comparators, the method attains a dynamic regret upper bound of $\mathcal{O}(\sqrt{d(1+S_T)T})$, up to logarithmic factors. This result significantly advances the theory of online learning in non-stationary environments.
📝 Abstract
We study dynamic regret minimization in unconstrained adversarial linear bandit problems. In this setting, a learner must minimize the cumulative loss relative to an arbitrary sequence of comparators $\boldsymbol{u}_1,\ldots,\boldsymbol{u}_T$ in $\mathbb{R}^d$, but receives only point-evaluation feedback on each round. We provide a simple approach to combining the guarantees of several bandit algorithms, allowing us to optimally adapt to the number of switches $S_T = \sum_t\mathbb{I}\{\boldsymbol{u}_t \neq \boldsymbol{u}_{t-1}\}$ of an arbitrary comparator sequence. In particular, we provide the first algorithm for linear bandits achieving the optimal regret guarantee of order $\mathcal{O}\big(\sqrt{d(1+S_T) T}\big)$ up to poly-logarithmic terms without prior knowledge of $S_T$, thus resolving a long-standing open problem.