Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the low sample efficiency of preference-based reinforcement learning (PbRL), which typically relies solely on pairwise comparisons. To improve online policy learning, we propose leveraging multi-way ranking feedback. Unlike prior approaches—whose performance degrades as the size of the comparison subset increases—we establish, for the first time, a theoretical guarantee that sample complexity improves significantly with larger subset sizes. We design M-AUPO, an algorithm that models ranking feedback via the Plackett–Luce model and selects action subsets by maximizing average uncertainty. Our theoretical analysis yields a suboptimality bound of $ ilde{mathcal{O}}ig(d T^{-1} sqrt{sum_{t=1}^T |S_t|^{-1}}ig)$ and a matching lower bound $Omega(d K^{-1} T^{-1/2})$, eliminating exponential dependence on parameter norms. This enhances both scalability and practical applicability of PbRL in high-dimensional settings.

Technology Category

Application Category

📝 Abstract
We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL's recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $ ilde{mathcal{O}}left( frac{d}{T} sqrt{ sum_{t=1}^T frac{1}{|S_t|}} ight)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter's norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $Ωleft( frac{d}{K sqrt{T}} ight)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.
Problem

Research questions and friction points this paper is trying to address.

Improving sample efficiency in preference-based reinforcement learning
Addressing limitations of pairwise comparisons with ranking feedback
Developing algorithms using multiple options for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adopts Plackett-Luce model for ranking feedback
Proposes M-AUPO algorithm maximizing average uncertainty
Achieves improved sample efficiency with larger subsets
🔎 Similar Papers
No similar papers found.
J
Joongkyu Lee
Seoul National University
S
Seouh-won Yi
Seoul National University
Min-hwan Oh
Min-hwan Oh
Seoul National University
Reinforcement LearningBandit AlgorithmsMachine Learning