Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enhancing large language models’ (LLMs) reasoning capability and interpretability in complex clinical tasks—without relying on supervised fine-tuning (SFT) or knowledge distillation from proprietary models (e.g., GPT-4o) to generate chain-of-thought (CoT) data. We propose AlphaMed, the first purely reinforcement learning (RL)-driven LLM that achieves strong medical reasoning emergence without SFT or CoT supervision. Its core contributions are threefold: (1) the first empirical demonstration that pure RL can induce reasoning emergence in medical LLMs; (2) identification of training data information density—not volume—as the critical driver of performance leaps; and (3) a critical reexamination and reconstruction of prevailing medical question-answering evaluation paradigms. AlphaMed achieves state-of-the-art results across six medical QA benchmarks, outperforming larger open-weight models (e.g., DeepSeek-V3-671B) and closed-source models (e.g., Claude-3.5-Sonnet), while markedly improving generalization and interpretable reasoning in intricate clinical scenarios.

Technology Category

Application Category

📝 Abstract
Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing medical LLM reasoning without costly supervised fine-tuning
Achieving state-of-the-art performance using rule-based RL on public datasets
Understanding dataset impact on reasoning emergence in medical QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses minimalist rule-based RL rewards
Eliminates need for supervised fine-tuning
Leverages public multiple-choice QA datasets
🔎 Similar Papers