FineMedLM-o1: Enhancing the Medical Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large medical foundation models exhibit insufficient reasoning capabilities in complex clinical scenarios—particularly differential diagnosis and personalized treatment planning. To address this, we propose the first large language model explicitly designed for deep medical reasoning. Our method introduces test-time training (TTT) to healthcare—a novel adaptation—and establishes a three-stage collaborative training paradigm: supervised fine-tuning (SFT), direct preference optimization (DPO), and TTT. We further design a high-quality, long-horizon medical dialogue synthesis framework to generate realistic, clinically grounded multi-turn interactions. Evaluated on key medical benchmarks, our model achieves an average 23% improvement over prior baselines; integrating TTT yields an additional 14% gain. As a contribution, we open-source the first high-complexity, multi-turn, differential-diagnosis-oriented medical dialogue dataset alongside the corresponding model—providing both a new trustworthy AI paradigm for clinical decision support and foundational resources for the research community.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLMs) have shown promise in medical applications such as disease diagnosis and treatment planning. However, most existing medical LLMs struggle with the advanced reasoning required for complex clinical scenarios, such as differential diagnosis or personalized treatment suggestions. We proposed FineMedLM-o1, which leverages high-quality synthetic medical data and long-form reasoning data for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), enabling advanced dialogue and deep reasoning capabilities. Additionally, we introduced Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning. Experimental results demonstrate that FineMedLM-o1 achieves a 23% average performance improvement over prior models on key medical benchmarks. Furthermore, the introduction of TTT provides an additional 14% performance boost, highlighting its effectiveness in enhancing medical reasoning capabilities. To support this process, we also proposed a novel method for synthesizing medical dialogue. Compared to other open-source datasets, our dataset stands out as superior in both quality and complexity. The project and data will be released on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Medical Diagnosis
Personalized Treatment
Innovation

Methods, ideas, or system contributions that make the work stand out.

FineMedLM-o1
test-time training
medical decision-making
🔎 Similar Papers
No similar papers found.
H
Hongzhou Yu
School of Computer Science, Fudan University, Shanghai, China
Tianhao Cheng
Tianhao Cheng
Fudan University
Large Language Model
Y
Ying Cheng
School of Computer Science, Fudan University, Shanghai, China
R
Rui Feng
School of Computer Science, Fudan University, Shanghai, China