DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the vulnerability of existing retrieval and reranking methods in scientific question answering to semantically similar but logically irrelevant (SSLI) passages, which often lead to factual errors and hallucinations. We present the first systematic identification and empirical validation of the SSLI problem across both stages of retrieval-augmented generation (RAG). To mitigate this issue, we propose DeepEra, a deep evidence reranking framework that leverages stepwise reasoning to move beyond superficial semantic matching and assess the logical consistency of candidate evidence. We introduce SciRAG-SSLI, a large-scale evaluation benchmark spanning ten scientific disciplines with 300,000 question-answer pairs, along with an evaluation protocol incorporating human-crafted distractors. Experimental results demonstrate that DeepEra significantly outperforms state-of-the-art rerankers in both logical relevance and factual reliability.

Technology Category

Application Category

📝 Abstract

With the rapid growth of scientific literature, scientific question answering (SciQA) has become increasingly critical for exploring and utilizing scientific knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating knowledge from external sources, thereby providing credible evidence for scientific question answering. But existing retrieval and reranking methods remain vulnerable to passages that are semantically similar but logically irrelevant, often reducing factual reliability and amplifying hallucinations.To address this challenge, we propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning, enabling more precise evaluation of candidate passages beyond surface-level semantics. To support systematic evaluation, we construct SciRAG-SSLI (Scientific RAG - Semantically Similar but Logically Irrelevant), a large-scale dataset comprising about 300K SciQA instances across 10 subjects, constructed from 10M scientific corpus. The dataset combines naturally retrieved contexts with systematically generated distractors to test logical robustness and factual grounding. Comprehensive evaluations confirm that our approach achieves superior retrieval performance compared to leading rerankers. To our knowledge, this work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.

Problem

Research questions and friction points this paper is trying to address.

Scientific Question Answering

Retrieval-Augmented Generation

Semantically Similar but Logically Irrelevant

Evidence Reranking

Factual Reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation

Evidence Reranking

Logical Relevance