🤖 AI Summary
This paper addresses the challenge of scarce explicit supervision labels in ranking tasks by proposing IRanker-3B, the first unified ranking foundation model tailored for recommendation, LLM routing, and passage re-ranking. Methodologically, it introduces an iterative elimination-based ranking paradigm that compresses full permutations into linear decision sequences; integrates PPO-based reinforcement learning, dynamic candidate pool pruning, and multi-scenario joint fine-tuning to enable zero-shot cross-domain generalization. Experiments demonstrate state-of-the-art performance among models of comparable scale across nine benchmarks, with ≥5% improvement in in-domain ranking accuracy. For cross-domain general-purpose tasks (GSM8K, IFEval, MathQA), zero-shot accuracy improves by ≥9% over the base LLM. Moreover, the generated interpretable chain-of-thought reasoning is transferable, enhancing downstream LLMs’ reasoning capabilities.
📝 Abstract
Ranking tasks are ubiquitous, encompassing applications such as recommendation systems, LLM routing, and item re-ranking. We propose to unify these tasks using a single ranking foundation model (FM), as it eliminates the need for designing different models for each specific ranking task. However, unlike general supervision tasks in LLMs, ranking tasks do not have clear labels for supervision, posing great challenges to developing a ranking FM. To overcome these challenges, we propose IRanker, a ranking FM framework with reinforcement learning (RL) and iterative decoding. Our insight is to decompose the complex ranking task into an iterative decoding process that eliminates the worst candidate from the candidate pool step by step, which significantly reduces the output combinatorial space and better utilizes the limited context length during RL training. We meticulously train and comprehensively evaluate an IRanker-3B model on nine datasets across three scenarios: recommendation, routing, and passage ranking. The results show that a single IRanker-3B achieves state-of-the-art results on several datasets compared to models of similar size, and even surpasses the performance of larger models on certain datasets. We further demonstrate the effectiveness of our RL design and the robustness of the iterative mechanism across different LLM sizes. Moreover, we conducted both in-domain and out-of-domain zero-shot generalization experiments, which showed that IRanker-3B achieved good generalization on in-domain ranking tasks compared to the base LLM by at least 5% improvement. Surprisingly, on out-of-domain generic LLM tasks, IRanker-3B outperformed the base model by at least 9% on GSM8K, IFEval, and MathQA. In addition, the thoughts generated by IRanker-3B during training could further enhance zero-shot LLM performance.