Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether medical reasoning capabilities can emerge in foundation language models without explicit reasoning-step annotations, leveraging reinforcement learning from verifiable rewards (RLVR). Method: We introduce RLVR to the medical domain for the first time, constructing sparse, verifiable reward signals from deterministic answers to medical multiple-choice questions (MCQs) to guide autonomous reasoning-path evolution in a 3B-parameter language model. Contributions/Results: (1) Reasoning capability emerges spontaneously—without any supervision on intermediate reasoning steps; (2) the model achieves performance on in-distribution tasks comparable to supervised fine-tuning; and (3) it demonstrates significantly improved out-of-distribution generalization, with an absolute accuracy gain of 8 percentage points. Collectively, this work establishes a novel paradigm for low-supervision, high-credibility medical AI reasoning—relying solely on answer-level verification rather than step-by-step annotation—thereby advancing trustworthy, scalable reasoning in resource-constrained medical AI settings.

Technology Category

Application Category

📝 Abstract
Reinforcement learning from verifiable rewards (RLVR) has recently gained attention for its ability to elicit self-evolved reasoning capabilitie from base language models without explicit reasoning supervisions, as demonstrated by DeepSeek-R1. While prior work on RLVR has primarily focused on mathematical and coding domains, its applicability to other tasks and domains remains unexplored. In this work, we investigate whether medical reasoning can emerge from RLVR. We introduce Med-RLVR as an initial study of RLVR in the medical domain leveraging medical multiple-choice question answering (MCQA) data as verifiable labels. Our results demonstrate that RLVR is not only effective for math and coding but also extends successfully to medical question answering. Notably, Med-RLVR achieves performance comparable to traditional supervised fine-tuning (SFT) on in-distribution tasks while significantly improving out-of-distribution generalization, with an 8-point accuracy gain. Further analysis of training dynamics reveals that, with no explicit reasoning supervision, reasoning emerges from the 3B-parameter base model. These findings underscore the potential of RLVR in domains beyond math and coding, opening new avenues for its application in knowledge-intensive fields such as medicine.
Problem

Research questions and friction points this paper is trying to address.

Explores medical reasoning via reinforcement learning.
Uses medical MCQA data as verifiable labels.
Improves out-of-distribution generalization significantly.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning from verifiable rewards
Medical multiple-choice question answering data
Emerging reasoning from 3B base model
🔎 Similar Papers
No similar papers found.