Neural Dueling Bandits

📅 2024-07-24

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 1

career value

247K/year

🤖 AI Summary

This work addresses the limitation of existing linear reward assumptions in contextual dueling bandits, where nonlinear reward functions render such assumptions invalid. We propose the first neural-network-based preference learning framework for dueling bandits, introducing Neural UCB and Neural Thompson Sampling algorithms—both designed for binary preference feedback. We establish sublinear regret bounds for both algorithms under standard neural tangent kernel (NTK) assumptions. Our contributions are threefold: (i) we break the linear reward constraint and establish a new paradigm for modeling nonlinear contextual preferences; (ii) we provide scalable, computationally efficient algorithms with rigorous theoretical guarantees; and (iii) empirical evaluations on synthetic benchmarks demonstrate significant improvements over linear baselines—particularly in capturing high-dimensional, nonlinear user preference structures—validating applicability to real-world ordinal decision-making tasks such as recommender systems and search ranking.

Technology Category

Application Category

📝 Abstract

Contextual dueling bandit is used to model the bandit problems, where a learner's goal is to find the best arm for a given context using observed noisy human preference feedback over the selected arms for the past contexts. However, existing algorithms assume the reward function is linear, which can be complex and non-linear in many real-life applications like online recommendations or ranking web search results. To overcome this challenge, we use a neural network to estimate the reward function using preference feedback for the previously selected arms. We propose upper confidence bound- and Thompson sampling-based algorithms with sub-linear regret guarantees that efficiently select arms in each round. We also extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution. Experimental results on the problem instances derived from synthetic datasets corroborate our theoretical results.

Problem

Research questions and friction points this paper is trying to address.

Modeling preference-based optimization with human feedback

Overcoming linear reward function limitations in real applications

Extending theoretical results to contextual bandits with binary feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural network estimates reward from preferences

UCB and Thompson sampling with sub-linear regret

Extends theory to contextual bandits with binary feedback

🔎 Similar Papers

No similar papers found.