ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether large language models (LLMs) genuinely internalize the core logical paradigms—deduction, induction, and abduction—underpinning scientific reasoning. Method: We introduce ARCHE, a novel task requiring models to parse scientific arguments into standardized Reasoning Logic Trees (RLTs) and explicitly annotate each inference step with its Peircean type. To support evaluation, we construct ARCHE Bench—the first logic-aware benchmark for scientific text—featuring logic-sensitive metrics: Entity Coverage and Inference Edge Accuracy. The benchmark is grounded in a high-quality, manually annotated dataset derived from 70 *Nature Communications* papers, comprising 1,900+ citations and 38,000 argumentative claims. Results: Evaluation across 10 state-of-the-art LLMs reveals a pervasive trade-off between logical validity and content completeness; none achieves both full structural fidelity and accurate logical classification of RLTs, exposing a fundamental gap between current LLM capabilities and authentic scientific reasoning.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce's fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to extract structured reasoning chains from scientific arguments
Assessing whether models understand fundamental inference modes like deduction and induction
Measuring the gap between current reasoning models and scientific argumentation rigor
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Latent Reasoning Chain Extraction task ARCHE
Proposes Reasoning Logic Tree with Peirce inference modes
Develops logic-aware metrics Entity Coverage and REA
🔎 Similar Papers
No similar papers found.
P
Pengze Li
Artificial Intelligence Innovation and Incubation Institute of Fudan University
J
Jiaqi Liu
Shanghai Artificial Intelligence Laboratory
Junchi Yu
Junchi Yu
University of Oxford
information theoryfoundation modelsgraph learning
Lihao Liu
Lihao Liu
Amazon
LLM-based AgentHealthcare AI
Mingyu Ding
Mingyu Ding
Assistant Professor, UNC Chapel Hill
RoboticsEmbodied AIComputer Vision
W
Wanli Ouyang
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong
S
Shixiang Tang
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong
X
Xi Chen
Artificial Intelligence Innovation and Incubation Institute of Fudan University, Shanghai Academy of AI for Science