🤖 AI Summary
In scientific literature question answering, the re-ranking stage of RAG-LLMs exhibits high sensitivity to subtle term variations, yet its robustness and factual consistency in scientific contexts remain unassessed systematically. To address this gap, we introduce SciRerankBench—the first domain-specific re-ranking benchmark for scientific text—spanning five disciplines and comprising three challenging sample categories: noisy inputs, semantically similar but logically unrelated passages, and counterfactual statements. We evaluate 13 state-of-the-art re-rankers across five large language model backbones along three dimensions: noise robustness, relevance disambiguation, and factual consistency. Our analysis uncovers critical weaknesses of existing methods in handling scientific terminology, logical coherence, and factual grounding. SciRerankBench establishes a novel evaluation paradigm for re-ranking in RAG pipelines and provides empirically grounded directions for targeted improvement.
📝 Abstract
Scientific literature question answering is a pivotal step towards new scientific discoveries. Recently, extit{two-stage} retrieval-augmented generated large language models (RAG-LLMs) have shown impressive advancements in this domain. Such a two-stage framework, especially the second stage (reranker), is particularly essential in the scientific domain, where subtle differences in terminology may have a greatly negative impact on the final factual-oriented or knowledge-intensive answers. Despite this significant progress, the potential and limitations of these works remain unexplored. In this work, we present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench), for evaluating rerankers within RAG-LLMs systems, spanning five scientific subjects. To rigorously assess the reranker performance in terms of noise resilience, relevance disambiguation, and factual consistency, we develop three types of question-context-answer (Q-C-A) pairs, i.e., Noisy Contexts (NC), Semantically Similar but Logically Irrelevant Contexts (SSLI), and Counterfactual Contexts (CC). Through systematic evaluation of 13 widely used rerankers on five families of LLMs, we provide detailed insights into their relative strengths and limitations. To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs, which provides valuable observations and guidance for their future development.