SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In scientific literature question answering, the re-ranking stage of RAG-LLMs exhibits high sensitivity to subtle term variations, yet its robustness and factual consistency in scientific contexts remain unassessed systematically. To address this gap, we introduce SciRerankBench—the first domain-specific re-ranking benchmark for scientific text—spanning five disciplines and comprising three challenging sample categories: noisy inputs, semantically similar but logically unrelated passages, and counterfactual statements. We evaluate 13 state-of-the-art re-rankers across five large language model backbones along three dimensions: noise robustness, relevance disambiguation, and factual consistency. Our analysis uncovers critical weaknesses of existing methods in handling scientific terminology, logical coherence, and factual grounding. SciRerankBench establishes a novel evaluation paradigm for re-ranking in RAG pipelines and provides empirically grounded directions for targeted improvement.

Technology Category

Application Category

📝 Abstract
Scientific literature question answering is a pivotal step towards new scientific discoveries. Recently, extit{two-stage} retrieval-augmented generated large language models (RAG-LLMs) have shown impressive advancements in this domain. Such a two-stage framework, especially the second stage (reranker), is particularly essential in the scientific domain, where subtle differences in terminology may have a greatly negative impact on the final factual-oriented or knowledge-intensive answers. Despite this significant progress, the potential and limitations of these works remain unexplored. In this work, we present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench), for evaluating rerankers within RAG-LLMs systems, spanning five scientific subjects. To rigorously assess the reranker performance in terms of noise resilience, relevance disambiguation, and factual consistency, we develop three types of question-context-answer (Q-C-A) pairs, i.e., Noisy Contexts (NC), Semantically Similar but Logically Irrelevant Contexts (SSLI), and Counterfactual Contexts (CC). Through systematic evaluation of 13 widely used rerankers on five families of LLMs, we provide detailed insights into their relative strengths and limitations. To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs, which provides valuable observations and guidance for their future development.
Problem

Research questions and friction points this paper is trying to address.

Evaluating rerankers in scientific RAG-LLMs for noise resilience
Assessing relevance disambiguation in scientific literature retrieval
Testing factual consistency against counterfactual scientific contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed SciRerankBench benchmark for rerankers
Created three Q-C-A pair types for evaluation
Evaluated 13 rerankers across five scientific subjects
🔎 Similar Papers
No similar papers found.
Haotian Chen
Haotian Chen
University of California, Los Angeles
Political EconomyNon-market StrategyAmerican Politics
Q
Qingqing Long
Computer Network Information Center, Chinese Academy of Sciences
M
Meng Xiao
Computer Network Information Center, Chinese Academy of Sciences
X
Xiao Luo
Peking University
W
Wei Ju
Peking University
Chengrui Wang
Chengrui Wang
Alibaba Group
Computer Vision
Xuezhi Wang
Xuezhi Wang
Research Scientist, Google DeepMind
Machine LearningNatural Language Processing
Yuanchun Zhou
Yuanchun Zhou
Computer Network Information Center,CAS
Data MiningBig Data Analysis
H
Hengshu Zhu
Computer Network Information Center, Chinese Academy of Sciences