ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited reasoning capabilities of foundation models in knowledge-intensive medical question answering, this paper introduces ReasonMed—the largest high-quality medical reasoning dataset to date (370K samples)—and proposes a multi-agent collaborative verification and refinement framework integrating error localization, correction, and reasoning-path distillation. Innovatively, we adopt a joint fine-tuning paradigm combining detailed chain-of-thought (CoT) rationales with concise final answers, significantly enhancing the reasoning performance of small-language models. Evaluated on PubMedQA, the resulting ReasonMed-7B model outperforms LLaMA3.1-70B by 4.60% and surpasses all prior state-of-the-art models under 10B parameters by 4.17%, establishing a new benchmark for medical reasoning.

Technology Category

Application Category

📝 Abstract
Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a extit{multi-agent verification and refinement process}, where we design an extit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60%.
Problem

Research questions and friction points this paper is trying to address.

Advancing medical reasoning in large language models
Creating a high-quality medical reasoning dataset
Improving model performance on medical question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent verification and refinement process
Error Refiner identifies and corrects errors
Combines Chain-of-Thought with concise summaries
🔎 Similar Papers
No similar papers found.
Y
Yu Sun
Alibaba DAMO Academy, School of Basic Medical Sciences, Lanzhou University
X
Xingyu Qian
Alibaba DAMO Academy, Gaoling School of Artificial Intelligence, Renmin University of China, Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Engineering Research Center of Next-Generation Intelligent Search and Recommendaion, MOE
Weiwen Xu
Weiwen Xu
The Chinese University of Hong Kong
natural language processing
H
Hao Zhang
Alibaba DAMO Academy
Chenghao Xiao
Chenghao Xiao
Durham University
Natural Language ProcessingInformation RetrievalRepresentation Learning
Long Li
Long Li
Research Staff Member, Inspur Group Co., Ltd.
Software Defined NetworkingNetwork Performance Optimization
Y
Yu Rong
Alibaba DAMO Academy, Hupan Lab
Wenbing Huang
Wenbing Huang
Associate Professor, Renmin University of China
Machine LearningAI for Science
Q
Qifeng Bai
School of Basic Medical Sciences, Lanzhou University
Tingyang Xu
Tingyang Xu
Alibaba DAMO Academy
Machine LearningDeep Graph LearningDrug Discovery