Biased Dueling Bandits with Stochastic Delayed Feedback

๐Ÿ“… 2024-08-26
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF

career value

230K/year
๐Ÿค– AI Summary
This work studies the preference-biased dueling bandits problem with stochastic delayed feedback, addressing the core challenge of delayed reward feedback hindering timely policy updates in online recommendation and advertising. We are the first to model the coupled effect of delay and preference bias, and propose two adaptive algorithms: one for settings with known delay distribution, and another for more realistic scenarios where only the expected delay is available. Methodologically, we integrate a delay-aware UCB framework, bias-corrected pairwise comparison estimation, and rigorous regret analysis. Theoretically, both algorithms achieve the optimal $O(sqrt{T})$ regret boundโ€”matching that of the non-delayed settingโ€”and constitute the first delay-robust optimal algorithms for this problem. Empirical evaluation on synthetic and real-world datasets validates the tightness of our theoretical bounds and demonstrates significant performance gains over baselines.

Technology Category

Application Category

๐Ÿ“ Abstract
The dueling bandit problem, an essential variation of the traditional multi-armed bandit problem, has become significantly prominent recently due to its broad applications in online advertising, recommendation systems, information retrieval, and more. However, in many real-world applications, the feedback for actions is often subject to unavoidable delays and is not immediately available to the agent. This partially observable issue poses a significant challenge to existing dueling bandit literature, as it significantly affects how quickly and accurately the agent can update their policy on the fly. In this paper, we introduce and examine the biased dueling bandit problem with stochastic delayed feedback, revealing that this new practical problem will delve into a more realistic and intriguing scenario involving a preference bias between the selections. We present two algorithms designed to handle situations involving delay. Our first algorithm, requiring complete delay distribution information, achieves the optimal regret bound for the dueling bandit problem when there is no delay. The second algorithm is tailored for situations where the distribution is unknown, but only the expected value of delay is available. We provide a comprehensive regret analysis for the two proposed algorithms and then evaluate their empirical performance on both synthetic and real datasets.
Problem

Research questions and friction points this paper is trying to address.

Addressing biased dueling bandits with delayed feedback
Handling stochastic delays in action feedback
Developing algorithms for unknown delay distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces biased dueling bandits with delayed feedback
Presents two algorithms for known and unknown delay distributions
Achieves optimal regret bounds with delay information
๐Ÿ”Ž Similar Papers
2024-07-24arXiv.orgCitations: 4
2024-05-25arXiv.orgCitations: 1