🤖 AI Summary
This paper addresses the low sample efficiency of preference-based reinforcement learning (PbRL), which typically relies solely on pairwise comparisons. To improve online policy learning, we propose leveraging multi-way ranking feedback. Unlike prior approaches—whose performance degrades as the size of the comparison subset increases—we establish, for the first time, a theoretical guarantee that sample complexity improves significantly with larger subset sizes. We design M-AUPO, an algorithm that models ranking feedback via the Plackett–Luce model and selects action subsets by maximizing average uncertainty. Our theoretical analysis yields a suboptimality bound of $ ilde{mathcal{O}}ig(d T^{-1} sqrt{sum_{t=1}^T |S_t|^{-1}}ig)$ and a matching lower bound $Omega(d K^{-1} T^{-1/2})$, eliminating exponential dependence on parameter norms. This enhances both scalability and practical applicability of PbRL in high-dimensional settings.
📝 Abstract
We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL's recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $ ilde{mathcal{O}}left( frac{d}{T} sqrt{ sum_{t=1}^T frac{1}{|S_t|}}
ight)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter's norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $Ωleft( frac{d}{K sqrt{T}}
ight)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.