Bandit and Delayed Feedback in Online Structured Prediction

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

This paper addresses learning in online structured prediction under incomplete feedback—specifically, bandit-style (arm-only) or delayed feedback. To overcome practical bottlenecks arising from large output spaces and the infeasibility of full feedback, we propose a novel pseudo-inverse matrix-based gradient estimator, integrated with inverse propensity scoring and delayed online optimization. Theoretically, we establish the first *K*-independent regret bound of *O*(*T*2/3) for the bandit setting; for delayed full-information and delayed bandit settings, we derive regret upper bounds of *O*(*D*2/3*T*1/3) and *O*((*DT*)2/3), respectively—breaking the conventional dependence on the output space size *K*. Empirical evaluation demonstrates the algorithm’s effectiveness and theoretical soundness on tasks such as multi-label classification.

Technology Category

Application Category

📝 Abstract

Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full information setup, we can achieve finite bounds on the surrogate regret, i.e., the extra target loss relative to the best possible surrogate loss. In practice, however, full information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, bandit and delayed feedback. For the bandit setting, using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of $O(sqrt{KT})$ for the time horizon $T$ and the size of the output set $K$. However, $K$ can be extremely large when outputs are highly complex, making this result less desirable. To address this, we propose an algorithm that achieves a surrogate regret bound of $O(T^{2/3})$, which is independent of $K$. This is enabled with a carefully designed pseudo-inverse matrix estimator. Furthermore, for the delayed full information feedback setup, we obtain a surrogate regret bound of $O(D^{2/3} T^{1/3})$ for the delay time $D$. We also provide algorithms for the delayed bandit feedback setup. Finally, we numerically evaluate the performance of the proposed algorithms in online classification with bandit feedback.

Problem

Research questions and friction points this paper is trying to address.

Online structured prediction with limited feedback

Bandit and delayed feedback algorithms

Surrogate regret bounds optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bandit feedback optimization

Pseudo-inverse matrix estimator

Delayed feedback algorithms

🔎 Similar Papers

Neural Dueling Bandits