Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations struggle to disentangle large language models’ reasoning from their retrieval capabilities on novel scientific questions and are susceptible to interference from memorized parameters and data distribution shifts. To address this, this work introduces the DeR2 sandbox benchmark, which decouples retrieval and reasoning through four evidence access conditions—instruction-only, concept-only, relevant documents, and full document set—focusing on core challenges such as multi-step synthesis, noise robustness, and evidence-driven reasoning. DeR2 employs a frozen corpus of theoretical papers (2023–2025), expert-annotated concepts, and verifiable reasoning chains, complemented by a two-stage validation protocol to prevent parameter leakage. It further introduces, for the first time, a mechanism gap metric to quantify retrieval and reasoning losses separately. Experiments reveal significant deficiencies in mainstream models’ structured reasoning, including fragility in mode switching and conceptual misuse, highlighting substantial room for improvement.

Technology Category

Application Category

📝 Abstract
Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility. We introduce DeR2, a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR2 decouples evidence access from reasoning via four regimes--Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)--yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution. To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023-2025 theoretical papers) with expert-annotated concepts and validated rationales. Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.
Problem

Research questions and friction points this paper is trying to address.

retrieval
reasoning
large language models
scientific reasoning
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-reasoning decoupling
controlled benchmark
document-grounded reasoning
parametric leakage prevention
fine-grained error attribution
🔎 Similar Papers
No similar papers found.
S
Shuangshuang Ying
Z
Zheyu Wang
Y
Yunjian Peng
J
Jin Chen
Y
Yuhao Wu
H
Hongbin Lin
D
Dingyu He
Siyi Liu
Siyi Liu
Hong Kong University of Science and Technology (Guangzhou)
Recommender SystemsInformation Retrieval
G
Gengchen Yu
Y
Yinzhu Piao
Y
Yuchen Wu
X
Xin Gui
Zhongyuan Peng
Zhongyuan Peng
Fudan University
LLM
X
Xin Li
X
Xeron Du
L
Libo Qin
Y
YiXin Cao
G
Ge Zhang
S
Stephen Huang