Non-stationary Bandit Convex Optimization: A Comprehensive Study

๐Ÿ“… 2025-06-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper studies bandit convex optimization in non-stationary environments, aiming to minimize dynamic regret with respect to the number of switches $S$, the total variation $Delta$ of loss functions, and the path length $P$ of comparators. Methodologically, it introduces TEWA-SEโ€”a two-scale algorithm combining tilted exponential-weighted averaging (TEWA) with a sleeping-experts frameworkโ€”and cExO, which integrates bandit-over-bandit meta-learning and clipped exploration over a discretized action space. The contributions are threefold: (i) it establishes the first matching upper and lower bounds on dynamic regret in terms of $S$ and $Delta$ for non-stationary bandit convex optimization; (ii) TEWA-SE achieves minimax-optimal regret $O(S^{1/2}T^{1/2} + Delta^{1/2}T^{1/2})$ under strong convexity; and (iii) cExO attains the same $S$โ€“$Delta$ regret bound for general convex losses while significantly improving the dependence on $P$.

Technology Category

Application Category

๐Ÿ“ Abstract
Bandit Convex Optimization is a fundamental class of sequential decision-making problems, where the learner selects actions from a continuous domain and observes a loss (but not its gradient) at only one point per round. We study this problem in non-stationary environments, and aim to minimize the regret under three standard measures of non-stationarity: the number of switches $S$ in the comparator sequence, the total variation $Delta$ of the loss functions, and the path-length $P$ of the comparator sequence. We propose a polynomial-time algorithm, Tilted Exponentially Weighted Average with Sleeping Experts (TEWA-SE), which adapts the sleeping experts framework from online convex optimization to the bandit setting. For strongly convex losses, we prove that TEWA-SE is minimax-optimal with respect to known $S$ and $Delta$ by establishing matching upper and lower bounds. By equipping TEWA-SE with the Bandit-over-Bandit framework, we extend our analysis to environments with unknown non-stationarity measures. For general convex losses, we introduce a second algorithm, clipped Exploration by Optimization (cExO), based on exponential weights over a discretized action space. While not polynomial-time computable, this method achieves minimax-optimal regret with respect to known $S$ and $Delta$, and improves on the best existing bounds with respect to $P$.
Problem

Research questions and friction points this paper is trying to address.

Minimizing regret in non-stationary bandit convex optimization
Adapting algorithms for unknown non-stationarity measures
Achieving minimax-optimal bounds for convex losses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts sleeping experts to bandit setting
Uses Bandit-over-Bandit for unknown non-stationarity
Employs clipped Exploration by Optimization method
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xiaoqi Liu
University of Oxford
D
Dorian Baudry
University of Oxford
Julian Zimmert
Julian Zimmert
Google Research
Bandit theoryReinforcement LearningMachine Learning
P
Patrick Rebeschini
University of Oxford
Arya Akhavan
Arya Akhavan
University of Oxford