🤖 AI Summary
Existing benchmarks struggle to evaluate large language models’ cross-disciplinary compositional reasoning capabilities in authentic interactive scientific settings. To address this gap, this work introduces XDomainBench—the first diagnostic benchmark specifically designed for interactive cross-disciplinary scientific reasoning—spanning 20 disciplines, 4 task types, and 8 interaction trajectory patterns. By formalizing compositional sequences and domain-mixing structures, XDomainBench enables systematic stress testing of models’ knowledge synthesis abilities. Experiments across 8,598 interactive sessions reveal that as compositional depth increases, models commonly suffer from error accumulation, reasoning breakdowns, and domain confusion, culminating in session-level reasoning collapse. This study further identifies, for the first time, the dual underlying causes of this phenomenon.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.