ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers

📅 2024-12-18

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Supervised fine-tuning (SFT) often degrades large language models’ (LLMs) general reasoning capabilities, hindering simultaneous improvement in performance and generalization for text re-ranking. To address this, we propose a novel “Chain-of-Thought (CoT) prompting + two-stage optimization” paradigm: CoT is first integrated into LLM re-ranking fine-tuning—injecting interpretable reasoning paths during SFT—and preference alignment is further refined via Direct Preference Optimization (DPO). Evaluated on the TREC Deep Learning benchmarks (2019/2020), our method significantly outperforms RankZephyr; critically, it retains strong general reasoning ability, achieving high accuracy on MMLU—demonstrating that enhanced re-ranking performance does not compromise foundational language modeling competence. Our core contribution is a new re-ranking fine-tuning framework that jointly optimizes task-specific effectiveness and model capability sustainability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated remarkable effectiveness in text reranking through works like RankGPT, leveraging their human-like reasoning about relevance. However, supervised fine-tuning for ranking often diminishes these models' general-purpose capabilities, including the crucial reasoning abilities that make them valuable for ranking. We introduce a novel approach integrating Chain-of-Thought prompting with an SFT-DPO (Supervised Fine-Tuning followed by Direct Preference Optimization) pipeline to preserve these capabilities while improving ranking performance. Our experiments on TREC 2019 and 2020 Deep Learning datasets show that our approach outperforms the state-of-the-art RankZephyr while maintaining strong performance on the Massive Multitask Language Understanding (MMLU) benchmark, demonstrating effective preservation of general-purpose capabilities through thoughtful fine-tuning strategies. Our code and data will be publicly released upon the acceptance of the paper.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLMs for ranking without losing general abilities

Enhancing ranking performance while preserving language modeling

Balancing ranking utility and reasoning capacity in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Chain-of-Thought prompting for SFT

Implements ranking preference optimization (RPO)

Enhances ranking while preserving language abilities

🔎 Similar Papers

LLM4Rerank: LLM-based Auto-Reranking Framework for Recommendations