Bandit Convex Optimization with Gradient Prediction Adaptivity

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses constrained convex optimization with imperfect gradient predictions, where even with accurate predictions, a regret lower bound of Ω(√T) persists under single-point feedback. The paper proposes the TP-VR-OPT algorithm, which, under two-point feedback, establishes—for the first time—an information-theoretic regret lower bound of Ω(√𝔼[S_T]) that depends on the expected cumulative prediction error 𝔼[S_T], and achieves a matching upper bound of O(√(d𝔼[S_T])), differing only by a √d factor. The method integrates variance-reduced gradient estimation, optimistic online learning, and adaptive step sizes, requiring no prior knowledge of either S_T or the time horizon T. Furthermore, it naturally extends to non-stationary environments, maintaining adaptivity with respect to both dynamic path length and prediction error.

📝 Abstract

Bandit convex optimization (BCO) is a fundamental online learning framework with partial feedback, where the learner observes only the loss incurred at the chosen decision point in each round. In this work, we investigate whether optimistic gradient predictions can improve worst-case regret guarantees in a prediction-adaptive manner. Specifically, given gradient predictions $m_t$, we seek regret bounds that scale with the cumulative prediction error $S_T=\sum_{t=1}^T \|\nabla f_t(x_t)-m_t\|^2.$ We first establish a negative result: under the single-point feedback protocol, an unavoidable $Ω(\sqrt{T})$ regret lower bound persists even when $S_T=o(T)$, showing that the variance of gradient estimation fundamentally obscures the benefit of accurate predictions. To overcome this barrier, we propose \emph{Two-Point Variance-Reduced Optimistic Gradient Descent} (TP-VR-OPT) for the two-point feedback setting. The key idea is a novel variance-reduced gradient estimator whose variance scales with the prediction error rather than the gradient norm. This yields a regret bound of $O\big(\sqrt{d\,\mathbb{E}[S_T]}\big),$ where $d$ is the decision dimension. Complementing this result, we establish an information-theoretic lower bound that scales as $Ω(\sqrt{\mathbb{E}[S_T]})$, providing a fundamental characterization of the best achievable prediction-adaptive regret and showing that TP-VR-OPT is optimal up to a factor of $\sqrt d$. We further develop adaptive variants that eliminate the need for prior knowledge of $\mathbb{E}[S_T]$ or the horizon $T$, and extend our framework to non-stationary environments, establishing dynamic regret guarantees that adapt simultaneously to the cumulative prediction error and the comparator path length.

Problem

Research questions and friction points this paper is trying to address.

Bandit Convex Optimization

Gradient Prediction

Adaptive Regret

Partial Feedback

Prediction Error

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bandit Convex Optimization

Gradient Prediction Adaptivity

Variance Reduction