The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a critical “knowledge–reasoning dissociation” in large language models (LLMs) for clinical natural language inference (NLI): while LLMs achieve high factual knowledge accuracy (0.918), their structured logical reasoning performance remains poor (mean reasoning accuracy only 0.25). Method: We introduce GKMRV—the first decoupled evaluation framework tailored to clinical NLI—explicitly distinguishing failures in factual acquisition from those in logical reasoning. GKMRV anchors a novel clinical trial NLI benchmark covering four reasoning categories and evaluates six state-of-the-art LLMs using both direct and chain-of-thought prompting. Contribution/Results: Our experiments confirm the universality of this dissociation across models and prompting strategies. GKMRV establishes a new reliability assessment paradigm for high-stakes domains, providing both theoretical foundations and empirical pathways to enhance LLM interpretability and clinical deployability.

Technology Category

Application Category

📝 Abstract
Large language models are often assumed to acquire increasingly structured, generalizable internal representations simply by scaling data and parameters. We interrogate this assumption by introducing a Clinical Trial Natural Language Inference benchmark comprising four reasoning families, Causal Attribution, Compositional Grounding, Epistemic Verification, and Risk State Abstraction. Each item is paired with a targeted Ground Knowledge and Meta-Level Reasoning Verification (GKMRV) probe, allowing us to dissociate failures of factual access from failures of inference. We evaluate six contemporary LLMs under both direct and chain of thought prompting. Models achieve near-ceiling GKMRV accuracy (mean accuracy 0.918) yet perform poorly on the main reasoning tasks (mean accuracy 0.25). Despite low accuracy, output inferences are highly consistent across samples (mean 0.87), indicating a systematic application of underlying heuristics and shortcuts. These results reveal fundamental structural and representational limitations: current LLMs often possess the relevant clinical knowledge but lack the structured, composable internal representations needed to deploy it reliably (e.g., integrating constraints, weighing evidence, or simulating counterfactuals). Decoupling knowledge from reasoning with GKMRV makes this dissociation explicit and measurable, providing an effective framework for probing the reliability of LLMs in high-stakes domains.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' limitations in clinical natural language inference
Dissociating knowledge access failures from reasoning failures in LLMs
Evaluating LLMs' reliability in high-stakes clinical reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clinical Trial Natural Language Inference benchmark
Ground Knowledge and Meta-Level Reasoning Verification
Decoupling knowledge from reasoning with GKMRV
🔎 Similar Papers
No similar papers found.
Maël Jullien
Maël Jullien
The University of Manchester
NLPNLI
Marco Valentino
Marco Valentino
University of Sheffield
Natural Language ProcessingNeurosymbolic AIExplanation
A
André Freitas
Department of Computer Science, University of Manchester, UK; National Biomarker Centre, CRUK-MI, University of Manchester, UK; Idiap Research Institute, Switzerland