Linear and Neural Dueling Bandits with Delayed Feedback

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the estimation bias in contextual dueling bandits caused by stochastic delayed feedback. The authors propose two novel algorithms—one based on linear models and the other on neural network function approximation—that uniquely integrate inverse probability weighting (IPW) directly into the dueling loss function to achieve unbiased correction for delayed or missing feedback. This approach overcomes the bias inherent in conventional weighting strategies that lack closed-form solutions, offering theoretical guarantees under delayed feedback settings. Theoretically, the method attains an $O(d\sqrt{T})$ regret bound in the linear setting and provides sublinear regret guarantees in the neural network setting. Empirical evaluations on both synthetic and real-world datasets demonstrate its superior performance.

📝 Abstract

Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption of immediate feedback, a condition frequently violated in real-world scenarios such as prompt optimization. This setting introduces a unique theoretical challenge: unlike linear bandits, dueling bandit estimators lack closed-form solutions, rendering naive adaptations of standard weighting techniques biased. To address this, we formalize the problem of Contextual Dueling Bandits with Stochastic Delayed Feedback and propose two novel algorithms: Linear (LDB-DF) and Neural (NDB-DF) Dueling Bandits with Delayed Feedback. Central to our approach is a novel estimator that integrates an Inverse Probability Weighting (IPW) mechanism directly into the loss function, ensuring unbiased correction for delayed or missing feedback. We provide comprehensive theoretical analysis, establishing an O(d*sqrt(T)) regret bound for the linear setting and sub-linear guarantees for the neural setting. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our propose.

Problem

Research questions and friction points this paper is trying to address.

Contextual Dueling Bandits

Delayed Feedback

Preference-based Learning

Stochastic Delay

Innovation

Methods, ideas, or system contributions that make the work stand out.

Delayed Feedback

Dueling Bandits

Inverse Probability Weighting

Contextual Bandits