ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model

📅 2025-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit limited multi-step causal reasoning capabilities for complex clinical diagnosis. Method: We propose a general-practice–oriented, reasoning-enhanced LLM. Our approach introduces a clinical-diagnosis–specific, multi-stage reasoning reinforcement paradigm; performs domain-adaptive fine-tuning and structured prompt optimization on 20,000 real-world clinical records; and constructs MedBench-Hard—a bilingual, high-difficulty evaluation benchmark covering seven medical specialties. Contribution/Results: On MedBench-Hard, our model achieves significantly higher Chinese diagnostic accuracy than GPT-4o and matches GPT-4 in English. It represents the first systematic improvement of LLMs’ higher-order diagnostic reasoning in authentic clinical settings. Both the model and MedBench-Hard are open-sourced, establishing a new paradigm and foundational infrastructure for trustworthy clinical AI.

Technology Category

Application Category

📝 Abstract
Recent advances in reasoning with large language models (LLMs)has shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at https://github.com/medfound/medfound.
Problem

Research questions and friction points this paper is trying to address.

Enhancing clinical diagnosis reasoning with LLMs
Addressing underexplored LLM applications in medicine
Benchmarking diagnostic performance across medical specialties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning-enhanced large language model for diagnosis
Trained on 20,000 real-world clinical records
Outperforms GPT-4o in Chinese diagnostic tasks
🔎 Similar Papers
No similar papers found.
W
Wuyang Lan
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
W
Wenzheng Wang
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
C
Changwei Ji
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Guoxing Yang
Guoxing Yang
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Y
Yongbo Zhang
South China Hospital, Medical School, Shenzhen University
X
Xiaohong Liu
South China Hospital, Medical School, Shenzhen University
Song Wu
Song Wu
Southwest University
Computer VisionMachine LearningDeep learningMultimedia
Guangyu Wang
Guangyu Wang
Houston Methodist
BioinformaticsComputational biologyAIepigenetics