Why Chain of Thought Fails in Clinical Text Understanding

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Chain-of-thought (CoT) prompting—widely adopted for enhancing interpretability in clinical NLP—exhibits pervasive failure in real-world electronic health record (EHR) understanding. Method: We conduct a systematic, large-scale evaluation across 87 clinically realistic tasks spanning 9 languages and 8 task categories, using 95 state-of-the-art LLMs. Fine-grained analysis includes reasoning length, medical concept alignment, and error pattern profiling; reliability is further assessed via LLM-as-a-judge and clinician adjudication. Contribution/Results: We empirically demonstrate, for the first time, a “transparency–reliability paradox”: improved explainability via CoT often degrades clinical reasoning accuracy—86.3% of models suffer significant performance decline under CoT. Notably, stronger models exhibit higher CoT robustness. This challenges the default adoption of CoT in high-stakes healthcare settings and establishes that trustworthy clinical reasoning necessitates joint optimization of transparency and stability—providing critical empirical grounding and new directions for safe LLM deployment in medicine.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly being applied to clinical care, a domain where both accuracy and transparent reasoning are critical for safe and trustworthy deployment. Chain-of-thought (CoT) prompting, which elicits step-by-step reasoning, has demonstrated improvements in performance and interpretability across a wide range of tasks. However, its effectiveness in clinical contexts remains largely unexplored, particularly in the context of electronic health records (EHRs), the primary source of clinical documentation, which are often lengthy, fragmented, and noisy. In this work, we present the first large-scale systematic study of CoT for clinical text understanding. We assess 95 advanced LLMs on 87 real-world clinical text tasks, covering 9 languages and 8 task types. Contrary to prior findings in other domains, we observe that 86.3% of models suffer consistent performance degradation in the CoT setting. More capable models remain relatively robust, while weaker ones suffer substantial declines. To better characterize these effects, we perform fine-grained analyses of reasoning length, medical concept alignment, and error profiles, leveraging both LLM-as-a-judge evaluation and clinical expert evaluation. Our results uncover systematic patterns in when and why CoT fails in clinical contexts, which highlight a critical paradox: CoT enhances interpretability but may undermine reliability in clinical text tasks. This work provides an empirical basis for clinical reasoning strategies of LLMs, highlighting the need for transparent and trustworthy approaches.

Problem

Research questions and friction points this paper is trying to address.

Investigating why chain-of-thought reasoning fails in clinical text understanding

Evaluating CoT performance degradation across 95 LLMs on clinical tasks

Analyzing the paradox between interpretability and reliability in clinical reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated Chain-of-Thought prompting on clinical tasks

Analyzed reasoning length and medical concept alignment

Identified performance degradation in clinical text understanding

🔎 Similar Papers

No similar papers found.