MedCritical: Enhancing Medical Reasoning in Small Language Models via Self-Collaborative Correction

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Small language models (SLMs) significantly underperform large language models (LLMs) on complex clinical reasoning tasks such as diagnosis and medical knowledge integration. To address this, we propose a two-stage self-collaborative correction framework: (1) a long-chain-of-thought prompting stage augmented by LLM-guided supervised fine-tuning to generate high-quality reasoning trajectories; and (2) a novel self-iterative Direct Preference Optimization (DPO) stage, wherein the SLM leverages its own error-correction paths for preference learning—eliminating reliance on continuous LLM evaluation. This enables efficient, low-cost knowledge consolidation and overcomes key efficiency bottlenecks of conventional knowledge distillation. On the CMExam benchmark, MedCritical-7B achieves 75.21% accuracy—surpassing Taiyi and Huatuo-o1-7B by 3.04 and 10.12 percentage points, respectively—and establishes new state-of-the-art performance among 7B-scale models. Its accuracy rivals that of LLM-distilled counterparts while reducing training cost substantially.

Technology Category

Application Category

📝 Abstract

In the field of medicine, complex reasoning tasks such as clinical diagnosis, treatment planning, and medical knowledge integration pose significant challenges, where small language models often underperform compared to large language models like GPT-4 and Deepseek. Recent knowledge distillation-based methods aim to address these issues through teacher-guided error correction, but this LLM as judge approach remains challenging in terms of cost, time, and efficiency. To circumvent this issue, we propose a novel two-stage framework, MedCritical, which uses a small language model fine-tuned by a large teacher model to play against itself. In the first stage, we extract high-level and detailed long-chain thought templates from the teacher model to guide the student model to generate more complex reasoning thoughts. In the second stage, we introduce direct preference optimization (DPO) through model self-iteration collaboration to enhance the reasoning ability of the student model by playing against the correction trajectory of the fine-tuned model during training. This model self-learning DPO approach teaches the student model to use its own error-driven insights to consolidate its skills and knowledge to solve complex problems, and achieves comparable results to traditional knowledge distillation methods using teacher models at a lower cost. Notably, our MedCritical 7B model outperforms the Taiyi and Huatuo-o1-7B models by 3.04% and 10.12% respectively on the CMExam benchmark, achieving new SOTA performance among 7B-class small models.

Problem

Research questions and friction points this paper is trying to address.

Enhancing medical reasoning in small language models

Reducing cost and inefficiency of teacher-guided error correction

Improving performance on clinical diagnosis and treatment tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-collaborative correction framework for small models

Extracted thought templates from teacher model

Direct preference optimization through self-iteration collaboration

🔎 Similar Papers

Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval