LiPO: Listwise Preference Optimization through Learning-to-Rank

📅 2024-02-02

🏛️ arXiv.org

📈 Citations: 38

✨ Influential: 3

career value

181K/year

🤖 AI Summary

This work addresses the information loss inherent in binary pairwise preference feedback (e.g., DPO, SLiC) used in large language model (LLM) preference alignment. It is the first to formulate alignment as a listwise learning-to-rank (LTR) problem. We propose LiPO-λ, a novel method that leverages multi-response ranking feedback and introduces a LambdaRank-inspired gradient weighting scheme to optimize preference loss at the list level—without reinforcement learning or policy-gradient approximations. LiPO-λ enables end-to-end, direct optimization of ranking-aware objectives. By moving beyond pairwise constraints, it significantly improves preference learning efficiency and generalization. Empirical evaluation across diverse alignment benchmarks—including those with ground-truth rankwise annotations—demonstrates that LiPO-λ consistently outperforms DPO variants and SLiC, particularly in ranking fidelity and robustness to noisy or sparse preferences.

Technology Category

Application Category

📝 Abstract

Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a thorough study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a extit{listwise} ranking problem and describe the LiPO framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment with DPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-$lambda$, which leverages a state-of-the-art extit{listwise} ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-$lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks with both curated and real rankwise preference data.

Problem

Research questions and friction points this paper is trying to address.

Language Model

Ranking Optimization

Human Preference

Innovation

Methods, ideas, or system contributions that make the work stand out.

LiPO Framework

Listwise Ranking

Advanced Listwise Ranking Objective

🔎 Similar Papers

No similar papers found.