O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning

📅 2025-01-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the poor interpretability and opaque decision logic of large language models (LLMs) in medical reasoning. To this end, we propose a medical-domain enhancement method grounded in reasoning-time chain-of-thought (CoT) scaling. Methodologically, our approach integrates hypothesis-driven differential diagnosis modeling, journey learning, and lightweight medical fine-tuning—enabling construction of an interpretable reasoning framework with only 500 supervised examples. We provide the first empirical evidence that CoT length positively correlates with medical task complexity and demonstrate generation of structured, clinically coherent differential diagnosis lists. On mainstream medical benchmarks—including MedQA—our method achieves absolute accuracy gains of 6–11% over strong baselines. These results establish a novel paradigm for few-shot medical AI reasoning that simultaneously delivers enhanced performance and rigorous interpretability.

Technology Category

Application Category

📝 Abstract
Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our model adhere to the principles of the hypothetico-deductive method, producing a list of potential conditions that may explain a patient's symptoms and systematically narrowing these possibilities by evaluating the evidence. These findings demonstrate the promising synergy between inference-time scaling and journey learning in advancing LLMs' real-world clinical reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Medical Decision Making
Diagnostic Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced Reasoning
Medical Decision-making
Journey Learning Integration
🔎 Similar Papers
No similar papers found.
Zhongzhen Huang
Zhongzhen Huang
Shanghai Jiao Tong University
Medical Image AnalysisVision and Language
G
Gui Geng
SPIRAL Lab
S
Shengyi Hua
Shanghai Jiao Tong University, SPIRAL Lab
Z
Zhen Huang
Generative AI Research Lab (GAIR)
Haoyang Zou
Haoyang Zou
Undergrad, Fudan University
Natural Language ProcessingMachine LearningGenerative AILarge Language Models
Shaoting Zhang
Shaoting Zhang
Shanghai AI Lab; SenseTime Research
Medical Image AnalysisComputer VisionFoundation Models
P
Pengfei Liu
Shanghai Jiao Tong University, SII, Generative AI Research Lab (GAIR)
X
Xiaofan Zhang
Shanghai Jiao Tong University, SPIRAL Lab