π€ AI Summary
This study addresses a critical discrepancy in medical question answering: improvements in model answer accuracy do not necessarily reflect enhanced reasoning fidelity. Through fine-grained, step-level auditing, the authors systematically demonstrate for the first time in the medical domain that student models trained via chain-of-thought distillation achieve a high MedQA-USMLE accuracy of 84.4%, yet exhibit a substantial increase in reasoning error ratesβfrom 30.6% to 50.3%. This counterintuitive phenomenon persists robustly across varying model scales, teacher capabilities, and segmentation strategies, while remaining undetected by conventional evaluation metrics. Combining LLM-based adjudication, blinded clinical expert review, and multidimensional controlled experiments, the work challenges the prevailing paradigm of assessing model reliability solely based on final-answer correctness.
π Abstract
Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-answer metrics including accuracy. We ask whether gains in answer quality are accompanied by improvements in the trace. In medical QA, where short answer options can leave a richer clinical justification under-specified, a Qwen3-8B student distilled from a DeepSeek-V3-family teacher improves on MedQA-USMLE answer metrics (SC@64 74.7% to 84.4%; expected calibration error (ECE) 0.096 to 0.034). Yet under a Kimi-K2.6 style-blind LLM-judge audit, its error rate over non-abstained steps rises from 30.6% to 50.3%. In this primary medical setting, answer quality and trace factuality move in opposite directions. This before--after pattern persists across evaluators, teacher strengths, student scales and families, medical benchmarks, and style, segmentation, and answer-correctness controls. A 150-step blinded audit by a clinical expert reproduces the same ordering. Boundary checks narrow the scope of the claim: the risk appears when a compact answer under-constrains the rationale and a capable student can imitate expert-like form without reliably grounding each local claim. Standard answer metrics and aggregate hedging rates do not reveal the shift. When such traces are released or reused, answer-level metrics alone are insufficient.