Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the contextual multi-armed bandit (CMAB) problem with delayed feedback under adversarial environments, focusing on regret minimization with general function approximation. For both finite policy classes and general function classes, we propose the first algorithm that integrates an online least-squares regression oracle with stability-based parameter analysis under the FIFO delay assumption. Our method introduces a novel stability theory for Vovk’s aggregating algorithm and unifies Hedge-style aggregation with adversarial delay modeling. We derive near-information-theoretic expected regret bounds: $O(sqrt{KT log |Pi|} + sqrt{D log |Pi|})$ for finite policy classes $Pi$, and $O(sqrt{KT mathcal{R}_T(mathcal{O})} + sqrt{d_{max} D eta})$ for general function classes, where $mathcal{R}_T(mathcal{O})$ denotes the sequential Rademacher complexity of the function class $mathcal{O}$, $D$ is the total delay, and $eta$ captures model complexity. These results significantly advance the theoretical understanding of delayed CMAB in adversarial settings.

Technology Category

Application Category

📝 Abstract
We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $Π$ we establish an optimal expected regret bound of $ O (sqrt{KT log |Π|} + sqrt{D log |Π|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ mathcal{F} $ with access to an online least-square regression oracle $mathcal{O}$ over $mathcal{F}$. In this setting, we achieve an expected regret bound of $O(sqrt{KTmathcal{R}_T(mathcal{O})} + sqrt{ d_{max} D β})$ assuming FIFO order, where $d_{max}$ is the maximal delay, $mathcal{R}_T(mathcal{O})$ is an upper bound on the oracle's regret and $β$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $mathcal{F}$ and show that its stability parameter $β$ is bounded by $log |mathcal{F}|$, resulting in an expected regret bound of $O(sqrt{KT log |mathcal{F}|} + sqrt{d_{max} D log |mathcal{F}|})$ which is a $sqrt{d_{max}}$ factor away from the lower bound of $Ω(sqrt{KT log |mathcal{F}|} + sqrt{D log |mathcal{F}|})$ that we also present.
Problem

Research questions and friction points this paper is trying to address.

Minimizing regret in adversarial contextual bandits with delayed feedback
Addressing general function approximation with online regression oracles
Analyzing stability and regret bounds for finite and infinite policy classes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses online least-square regression oracle
Handles adversarial delayed feedback scenarios
Achieves sublinear regret with stability analysis
🔎 Similar Papers
No similar papers found.
O
Orin Levy
Blavatnik School of Computer Science, Tel Aviv University
L
Liad Erez
Blavatnik School of Computer Science, Tel Aviv University
Alon Cohen
Alon Cohen
Tel-Aviv University and Google
Machine Learning
Yishay Mansour
Yishay Mansour
Tel Aviv University
machine learningreinforcement learningalgorithmic game theory