DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering

📅 2026-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of existing retrieval and reranking methods in scientific question answering to semantically similar but logically irrelevant (SSLI) passages, which often lead to factual errors and hallucinations. We present the first systematic identification and empirical validation of the SSLI problem across both stages of retrieval-augmented generation (RAG). To mitigate this issue, we propose DeepEra, a deep evidence reranking framework that leverages stepwise reasoning to move beyond superficial semantic matching and assess the logical consistency of candidate evidence. We introduce SciRAG-SSLI, a large-scale evaluation benchmark spanning ten scientific disciplines with 300,000 question-answer pairs, along with an evaluation protocol incorporating human-crafted distractors. Experimental results demonstrate that DeepEra significantly outperforms state-of-the-art rerankers in both logical relevance and factual reliability.

Technology Category

Application Category

📝 Abstract
With the rapid growth of scientific literature, scientific question answering (SciQA) has become increasingly critical for exploring and utilizing scientific knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating knowledge from external sources, thereby providing credible evidence for scientific question answering. But existing retrieval and reranking methods remain vulnerable to passages that are semantically similar but logically irrelevant, often reducing factual reliability and amplifying hallucinations.To address this challenge, we propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning, enabling more precise evaluation of candidate passages beyond surface-level semantics. To support systematic evaluation, we construct SciRAG-SSLI (Scientific RAG - Semantically Similar but Logically Irrelevant), a large-scale dataset comprising about 300K SciQA instances across 10 subjects, constructed from 10M scientific corpus. The dataset combines naturally retrieved contexts with systematically generated distractors to test logical robustness and factual grounding. Comprehensive evaluations confirm that our approach achieves superior retrieval performance compared to leading rerankers. To our knowledge, this work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.
Problem

Research questions and friction points this paper is trying to address.

Scientific Question Answering
Retrieval-Augmented Generation
Semantically Similar but Logically Irrelevant
Evidence Reranking
Factual Reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
Evidence Reranking
Logical Relevance
Scientific Question Answering
Hallucination Mitigation
🔎 Similar Papers
No similar papers found.
Haotian Chen
Haotian Chen
University of California, Los Angeles
Political EconomyNon-market StrategyAmerican Politics
Q
Qingqing Long
Computer Network Information Center, Chinese Academy of Sciences
S
Siyu Pu
Computer Network Information Center, Chinese Academy of Sciences
Xiao Luo
Xiao Luo
University of Wisconsin–Madison
Machine LearningLLMML for ScienceStatistical ModelingBioinformatics
W
Wei Ju
Peking University
M
Meng Xiao
Computer Network Information Center, Chinese Academy of Sciences
Yuanchun Zhou
Yuanchun Zhou
Computer Network Information Center,CAS
Data MiningBig Data Analysis
J
Jianghua Zhao
Computer Network Information Center, Chinese Academy of Sciences
Xuezhi Wang
Xuezhi Wang
Research Scientist, Google DeepMind
Machine LearningNatural Language Processing