🤖 AI Summary
Existing LLM-based re-ranking models underperform in complex ranking scenarios due to insufficient reasoning-intensive training data.
Method: We propose ReasonRank—a novel framework integrating automated reasoning data synthesis, self-consistency filtering, and multi-view ranking reward reinforcement learning. Specifically, DeepSeek-R1 is leveraged to generate high-quality reasoning-aware ranking labels; the model is then trained via two stages—supervised fine-tuning followed by reinforcement learning with a multi-view ranking reward that jointly optimizes ranking consistency and logical coherence.
Contribution/Results: ReasonRank achieves a new state-of-the-art score of 40.6 on the BRIGHT leaderboard, significantly outperforming prior methods, while maintaining lower inference latency than the pointwise model Rank1. Its core contribution is the first reasoning-enhanced training paradigm specifically designed for paragraph re-ranking, systematically improving both logical reasoning capability and ranking coordination.
📝 Abstract
Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker extbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. extbf{Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboardfootnote{https://brightbenchmark.github.io/}.} Our codes are available at https://github.com/8421BCD/ReasonRank.