ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address the limited reasoning capabilities of foundation models in knowledge-intensive medical question answering, this paper introduces ReasonMed—the largest high-quality medical reasoning dataset to date (370K samples)—and proposes a multi-agent collaborative verification and refinement framework integrating error localization, correction, and reasoning-path distillation. Innovatively, we adopt a joint fine-tuning paradigm combining detailed chain-of-thought (CoT) rationales with concise final answers, significantly enhancing the reasoning performance of small-language models. Evaluated on PubMedQA, the resulting ReasonMed-7B model outperforms LLaMA3.1-70B by 4.60% and surpasses all prior state-of-the-art models under 10B parameters by 4.17%, establishing a new benchmark for medical reasoning.

Technology Category

Application Category

📝 Abstract

Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a extit{multi-agent verification and refinement process}, where we design an extit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60%.

Problem

Research questions and friction points this paper is trying to address.

Advancing medical reasoning in large language models

Creating a high-quality medical reasoning dataset

Improving model performance on medical question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent verification and refinement process

Error Refiner identifies and corrects errors

Combines Chain-of-Thought with concise summaries

🔎 Similar Papers

DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models