🤖 AI Summary
Current evaluations struggle to disentangle large language models’ reasoning from their retrieval capabilities on novel scientific questions and are susceptible to interference from memorized parameters and data distribution shifts. To address this, this work introduces the DeR2 sandbox benchmark, which decouples retrieval and reasoning through four evidence access conditions—instruction-only, concept-only, relevant documents, and full document set—focusing on core challenges such as multi-step synthesis, noise robustness, and evidence-driven reasoning. DeR2 employs a frozen corpus of theoretical papers (2023–2025), expert-annotated concepts, and verifiable reasoning chains, complemented by a two-stage validation protocol to prevent parameter leakage. It further introduces, for the first time, a mechanism gap metric to quantify retrieval and reasoning losses separately. Experiments reveal significant deficiencies in mainstream models’ structured reasoning, including fragility in mode switching and conceptual misuse, highlighting substantial room for improvement.
📝 Abstract
Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility. We introduce DeR2, a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR2 decouples evidence access from reasoning via four regimes--Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)--yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution. To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023-2025 theoretical papers) with expert-annotated concepts and validated rationales. Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.