Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses online convex optimization in adversarial environments under bandit feedback, where only the loss values at two queried points are observable. Focusing on μ-strongly convex loss functions, the paper proposes a novel algorithm based on two-point gradient estimation and introduces high-probability analysis techniques tailored to handle heavy-tailed noise, thereby overcoming the limitations of conventional concentration inequalities. The authors establish, for the first time, a high-probability regret bound of O(d(log T + log(1/δ))/μ), which is minimax optimal in both the time horizon T and the dimension d. This result resolves a long-standing open problem in the field and represents a significant advance in the theory of bandit online learning.

Technology Category

Application Category

📝 Abstract

We consider the problem of Online Convex Optimization (OCO) with two-point bandit feedback in an adversarial environment. In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at two points. While it is well-known that two-point feedback allows for gradient estimation, achieving tight high-probability regret bounds for strongly convex functions still remained open as highlighted by \citet{agarwal2010optimal}. The primary challenge lies in the heavy-tailed nature of bandit gradient estimators, which makes standard concentration analysis difficult. In this paper, we resolve this open challenge by providing the first high-probability regret bound of $O(d(\log T + \log(1/δ))/μ)$ for $μ$-strongly convex losses. Our result is minimax optimal with respect to both the time horizon $T$ and the dimension $d$.

Problem

Research questions and friction points this paper is trying to address.

Online Convex Optimization

Two-Point Bandit Feedback

High-Probability Regret

Strongly Convex Functions

Adversarial Environment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Convex Optimization

Two-point Bandit Feedback

High-probability Regret