🤖 AI Summary
This work addresses the vulnerability of existing retrieval and reranking methods in scientific question answering to semantically similar but logically irrelevant (SSLI) passages, which often lead to factual errors and hallucinations. We present the first systematic identification and empirical validation of the SSLI problem across both stages of retrieval-augmented generation (RAG). To mitigate this issue, we propose DeepEra, a deep evidence reranking framework that leverages stepwise reasoning to move beyond superficial semantic matching and assess the logical consistency of candidate evidence. We introduce SciRAG-SSLI, a large-scale evaluation benchmark spanning ten scientific disciplines with 300,000 question-answer pairs, along with an evaluation protocol incorporating human-crafted distractors. Experimental results demonstrate that DeepEra significantly outperforms state-of-the-art rerankers in both logical relevance and factual reliability.
📝 Abstract
With the rapid growth of scientific literature, scientific question answering (SciQA) has become increasingly critical for exploring and utilizing scientific knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating knowledge from external sources, thereby providing credible evidence for scientific question answering. But existing retrieval and reranking methods remain vulnerable to passages that are semantically similar but logically irrelevant, often reducing factual reliability and amplifying hallucinations.To address this challenge, we propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning, enabling more precise evaluation of candidate passages beyond surface-level semantics. To support systematic evaluation, we construct SciRAG-SSLI (Scientific RAG - Semantically Similar but Logically Irrelevant), a large-scale dataset comprising about 300K SciQA instances across 10 subjects, constructed from 10M scientific corpus. The dataset combines naturally retrieved contexts with systematically generated distractors to test logical robustness and factual grounding. Comprehensive evaluations confirm that our approach achieves superior retrieval performance compared to leading rerankers. To our knowledge, this work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.