🤖 AI Summary
This study addresses the poor interpretability and opaque decision logic of large language models (LLMs) in medical reasoning. To this end, we propose a medical-domain enhancement method grounded in reasoning-time chain-of-thought (CoT) scaling. Methodologically, our approach integrates hypothesis-driven differential diagnosis modeling, journey learning, and lightweight medical fine-tuning—enabling construction of an interpretable reasoning framework with only 500 supervised examples. We provide the first empirical evidence that CoT length positively correlates with medical task complexity and demonstrate generation of structured, clinically coherent differential diagnosis lists. On mainstream medical benchmarks—including MedQA—our method achieves absolute accuracy gains of 6–11% over strong baselines. These results establish a novel paradigm for few-shot medical AI reasoning that simultaneously delivers enhanced performance and rigorous interpretability.
📝 Abstract
Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our model adhere to the principles of the hypothetico-deductive method, producing a list of potential conditions that may explain a patient's symptoms and systematically narrowing these possibilities by evaluating the evidence. These findings demonstrate the promising synergy between inference-time scaling and journey learning in advancing LLMs' real-world clinical reasoning capabilities.