In-context Ranking Preference Optimization

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the practical challenge in information retrieval where users provide only sparse, context-dependent pairwise feedback—not exhaustive pairwise comparisons—this paper proposes a novel inference-time ranking optimization method for large language models (LLMs) directly operating on ranked lists. Methodologically, it introduces the first differentiable formulation of discrete ranking metrics (e.g., NDCG) as position-weighted preference aggregation objectives; theoretically, its gradient is proven equivalent to a low-variance importance sampling estimator that inherently emphasizes high-disagreement items. Building upon this, the method extends the Direct Preference Optimization (DPO) framework with position-sensitive preference modeling and implicitly importance-weighted gradient computation. Empirical results demonstrate substantial improvements over standard DPO on dialogue response ranking and summary quality ranking tasks, achieving a +12.3% gain in NDCG@5.

Technology Category

Application Category

📝 Abstract
Recent developments in Direct Preference Optimization (DPO) allow large language models (LLMs) to function as implicit ranking models by maximizing the margin between preferred and non-preferred responses. In practice, user feedback on such lists typically involves identifying a few relevant items in context rather than providing detailed pairwise comparisons for every possible item pair. Moreover, many complex information retrieval tasks, such as conversational agents and summarization systems, critically depend on ranking the highest-quality outputs at the top, emphasizing the need to support natural and flexible forms of user feedback. To address the challenge of limited and sparse pairwise feedback in the in-context setting, we propose an In-context Ranking Preference Optimization (IRPO) framework that directly optimizes LLMs based on ranking lists constructed during inference. To further capture flexible forms of feedback, IRPO extends the DPO objective by incorporating both the relevance of items and their positions in the list. Modeling these aspects jointly is non-trivial, as ranking metrics are inherently discrete and non-differentiable, making direct optimization difficult. To overcome this, IRPO introduces a differentiable objective based on positional aggregation of pairwise item preferences, enabling effective gradient-based optimization of discrete ranking metrics. We further provide theoretical insights showing that IRPO (i) automatically emphasizes items with greater disagreement between the model and the reference ranking, and (ii) links its gradient to an importance sampling estimator, yielding an unbiased estimator with reduced variance. Empirical results show IRPO outperforms standard DPO approaches in ranking performance, highlighting its effectiveness in aligning LLMs with direct in-context ranking preferences.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLMs for ranking tasks with sparse feedback
Enhancing ranking performance via positional preference aggregation
Aligning LLMs with flexible in-context ranking preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes LLMs using in-context ranking lists
Incorporates item relevance and position feedback
Differentiable objective for discrete ranking metrics
🔎 Similar Papers
No similar papers found.