Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses a critical discrepancy in medical question answering: improvements in model answer accuracy do not necessarily reflect enhanced reasoning fidelity. Through fine-grained, step-level auditing, the authors systematically demonstrate for the first time in the medical domain that student models trained via chain-of-thought distillation achieve a high MedQA-USMLE accuracy of 84.4%, yet exhibit a substantial increase in reasoning error ratesβ€”from 30.6% to 50.3%. This counterintuitive phenomenon persists robustly across varying model scales, teacher capabilities, and segmentation strategies, while remaining undetected by conventional evaluation metrics. Combining LLM-based adjudication, blinded clinical expert review, and multidimensional controlled experiments, the work challenges the prevailing paradigm of assessing model reliability solely based on final-answer correctness.
πŸ“ Abstract
Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-answer metrics including accuracy. We ask whether gains in answer quality are accompanied by improvements in the trace. In medical QA, where short answer options can leave a richer clinical justification under-specified, a Qwen3-8B student distilled from a DeepSeek-V3-family teacher improves on MedQA-USMLE answer metrics (SC@64 74.7% to 84.4%; expected calibration error (ECE) 0.096 to 0.034). Yet under a Kimi-K2.6 style-blind LLM-judge audit, its error rate over non-abstained steps rises from 30.6% to 50.3%. In this primary medical setting, answer quality and trace factuality move in opposite directions. This before--after pattern persists across evaluators, teacher strengths, student scales and families, medical benchmarks, and style, segmentation, and answer-correctness controls. A 150-step blinded audit by a clinical expert reproduces the same ordering. Boundary checks narrow the scope of the claim: the risk appears when a compact answer under-constrains the rationale and a capable student can imitate expert-like form without reliably grounding each local claim. Standard answer metrics and aggregate hedging rates do not reveal the shift. When such traces are released or reused, answer-level metrics alone are insufficient.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-thought distillation
Medical QA
Reasoning trace
Factuality
Answer accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Distillation
Step-Level Audit
Medical Reasoning
Trace Factuality
Model Calibration
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhaoyang Jiang
School of Health & Wellbeing, University of Glasgow, Glasgow, UK
X
Xuanqi Peng
School of Health & Wellbeing, University of Glasgow, Glasgow, UK
Fei Teng
Fei Teng
Reader in Intelligent Energy Systems, Imperial College London
Stability-constrained OptimisationCyber-resilient System OperationData Privacy and Trading
Z
Zhizhong Fu
School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
Y
Yunsoo Kim
Institute of Health Informatics, University College London, London, UK
J
Jiacong Mi
School of Health & Wellbeing, University of Glasgow, Glasgow, UK
Z
Zicheng Li
School of Health & Wellbeing, University of Glasgow, Glasgow, UK
Honghan Wu
Honghan Wu
Professor of Health Informatics and AI, University of Glasgow
AI in medicineHealth Informatics