🤖 AI Summary
This work addresses the estimation bias in contextual dueling bandits caused by stochastic delayed feedback. The authors propose two novel algorithms—one based on linear models and the other on neural network function approximation—that uniquely integrate inverse probability weighting (IPW) directly into the dueling loss function to achieve unbiased correction for delayed or missing feedback. This approach overcomes the bias inherent in conventional weighting strategies that lack closed-form solutions, offering theoretical guarantees under delayed feedback settings. Theoretically, the method attains an $O(d\sqrt{T})$ regret bound in the linear setting and provides sublinear regret guarantees in the neural network setting. Empirical evaluations on both synthetic and real-world datasets demonstrate its superior performance.
📝 Abstract
Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in
recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption
of immediate feedback, a condition frequently violated in real-world scenarios such as prompt optimization. This
setting introduces a unique theoretical challenge: unlike linear bandits, dueling bandit estimators lack closed-form
solutions, rendering naive adaptations of standard weighting techniques biased. To address this, we formalize the
problem of Contextual Dueling Bandits with Stochastic Delayed Feedback and propose two novel algorithms: Linear
(LDB-DF) and Neural (NDB-DF) Dueling Bandits with Delayed Feedback. Central to our approach is a novel estimator that
integrates an Inverse Probability Weighting (IPW) mechanism directly into the loss function, ensuring unbiased
correction for delayed or missing feedback. We provide comprehensive theoretical analysis, establishing an
O(d*sqrt(T)) regret bound for the linear setting and sub-linear guarantees for the neural setting. Extensive
experiments on both simulated and real-world datasets demonstrate the effectiveness of our propose.