MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical vision-language models (VLMs) suffer from poor generalization in supervised fine-tuning (SFT) due to scarcity of high-quality expert annotations, while reinforcement learning (RL) remains impractical owing to the absence of reliable reward signals. To address this, we propose MedGR²—the first generative reward learning framework tailored for medical AI. MedGR² establishes a self-enhancing closed loop—“generate → evaluate → optimize → regenerate”—by jointly training a data generator and a reward model, enabling continuous autonomous generation of high-fidelity multimodal medical data. It integrates generative reward modeling, SFT, and group-relative policy optimization (GRPO), facilitating efficient training of compact models. Experiments show that SFT using solely generated data already surpasses baselines trained on large-scale human-annotated datasets; further incorporating GRPO achieves state-of-the-art performance across cross-modal and cross-task benchmarks, with small models matching the performance of foundation models ten times larger in parameter count.

Technology Category

Application Category

📝 Abstract
The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR$^2$), a novel framework that creates a self-improving virtuous cycle. MedGR$^2$ co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR$^2$-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR$^2$, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR$^2$ presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of expert-annotated medical data for VLMs
Overcoming poor generalization in unseen modalities and tasks
Solving lack of reliable reward signals for medical reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative reward learning for medical reasoning
Automated creation of multi-modal medical data
Self-improving cycle with data generator and reward model
🔎 Similar Papers
No similar papers found.