K-order Ranking Preference Optimization for Large Language Models

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In LLM-based ranking tasks, conventional full-order optimization misaligns with practical requirements, while tail-position feedback is unreliable. To address this, we propose K-th Order Preference Optimization (KPO), a method focusing on consistency among the top-K ranked items. Our key contributions are: (1) the first extension of preference learning to K-order ranking, modeling relative order relations among the top-K items via an extended Plackett–Luce model; and (2) a dynamic K-estimation mechanism coupled with a curriculum learning strategy, enabling query-adaptive ranking and training efficiency. Evaluated on multi-task ranking benchmarks, KPO significantly outperforms listwise baselines such as Listwise-DPO, achieving superior sample efficiency and robustness to label noise. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
To adapt large language models (LLMs) to ranking tasks, existing list-wise methods, represented by list-wise Direct Preference Optimization (DPO), focus on optimizing partial-order or full-order list ranking consistency for LLMs to enhance their ranking abilities. However, we argue that optimizing top-K ranking consistency could be more appropriate for real-world applications. There are two main reasons: (1) users are typically concerned with only the top-K results, making top-K ranking more important, and (2) tail items often lack precise feedback, making top-K ranking more reliable. Based on this, we propose K-order Ranking Preference Optimization (KPO) by extending the DPO's Plackett-Luce model to accommodate top-K rankings. Additionally, recognizing that the number of important items can vary across queries, we extend KPO to dynamically determine appropriate K for different samples and introduce a curriculum learning strategy to boost training efficiency. Extensive experiments demonstrate the effectiveness of KPO, highlighting its high sample efficiency and robustness to noise. The code is available at https://github.com/Lanyu0303/KPO.
Problem

Research questions and friction points this paper is trying to address.

Optimizing top-K ranking consistency for LLMs
Adapting ranking tasks to dynamic K per query
Enhancing sample efficiency and noise robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends DPO's Plackett-Luce model for top-K rankings
Dynamically determines K per query for flexibility
Uses curriculum learning to enhance training efficiency
🔎 Similar Papers
No similar papers found.
Shihao Cai
Shihao Cai
University of Science and Technology of China
large language modelsrecommendation
C
Chongming Gao
University of Science and Technology of China
Y
Yang Zhang
National University of Singapore
W
Wentao Shi
University of Science and Technology of China
Jizhi Zhang
Jizhi Zhang
USTC
RecommendationTrustworthy AILarge Personalized Model
Keqin Bao
Keqin Bao
University of Science and Technology of China
Large Language ModelsRecommender Systems
Q
Qifan Wang
Meta AI
F
Fuli Feng
University of Science and Technology of China