Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives

πŸ“… 2026-03-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of unifying stochastic and adversarial preference handling in dueling bandits under unknown environments. It proposes a unified framework that requires no prior knowledge of the environment type, leveraging a MetaDueling black-box reduction to transform multi-way feedback into unbiased pairwise signals. Combined with a newly designed AlgBorda algorithm and a Versatile-DB mechanism, the approach enables adaptive learning and optimized regret bounds. Notably, it achieves the first β€œbest-of-both-worlds” performance in multi-dueling settings: under the Condorcet objective, it simultaneously attains an adversarial pseudo-regret of $O(\sqrt{KT})$ and an instance-optimal stochastic pseudo-regret of $O(\sum \log T / \Delta_i)$; under the Borda objective, it also reaches near-optimal regret bounds. All results match existing lower bounds.

Technology Category

Application Category

πŸ“ Abstract
Multi-dueling bandits, where a learner selects $m \geq 2$ arms per round and observes only the winner, arise naturally in many applications including ranking and recommendation systems, yet a fundamental question has remained open: can a single algorithm perform optimally in both stochastic and adversarial environments, without knowing which regime it faces? We answer this affirmatively, providing the first best-of-both-worlds algorithms for multi-dueling bandits under both Condorcet and Borda objectives. For the Condorcet setting, we propose \texttt{MetaDueling}, a black-box reduction that converts any dueling bandit algorithm into a multi-dueling bandit algorithm by transforming multi-way winner feedback into an unbiased pairwise signal. Instantiating our reduction with \texttt{Versatile-DB} yields the first best-of-both-worlds algorithm for multi-dueling bandits: it achieves $O(\sqrt{KT})$ pseudo-regret against adversarial preferences and the instance-optimal $O\!\left(\sum_{i \neq a^\star} \frac{\log T}{Ξ”_i}\right)$ pseudo-regret under stochastic preferences, both simultaneously and without prior knowledge of the regime. For the Borda setting, we propose \AlgBorda, a stochastic-and-adversarial algorithm that achieves $O\left(K^2 \log KT + K \log^2 T + \sum_{i: Ξ”_i^{\mathrm{B}} > 0} \frac{K\log KT}{(Ξ”_i^{\mathrm{B}})^2}\right)$ regret in stochastic environments and $O\left(K \sqrt{T \log KT} + K^{1/3} T^{2/3} (\log K)^{1/3}\right)$ regret against adversaries, again without prior knowledge of the regime. We complement our upper bounds with matching lower bounds for the Condorcet setting. For the Borda setting, our upper bounds are near-optimal with respect to the lower bounds (within a factor of $K$) and match the best-known results in the literature.
Problem

Research questions and friction points this paper is trying to address.

multi-dueling bandits
best-of-both-worlds
stochastic preferences
adversarial preferences
Condorcet and Borda objectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

best-of-both-worlds
multi-dueling bandits
Condorcet objective
Borda objective
black-box reduction
πŸ”Ž Similar Papers
2024-05-25arXiv.orgCitations: 1
S
S. Akash
Indian Institute of Technology Patna
Pratik Gajane
Pratik Gajane
Unknown affiliation
Sequential decision makingReinforcement learningFairness in machine learningPrivacy in machine learning
J
Jawar Singh
Indian Institute of Technology Patna