🤖 AI Summary
Medical vision-language models (VLMs) suffer from poor generalization in supervised fine-tuning (SFT) due to scarcity of high-quality expert annotations, while reinforcement learning (RL) remains impractical owing to the absence of reliable reward signals. To address this, we propose MedGR²—the first generative reward learning framework tailored for medical AI. MedGR² establishes a self-enhancing closed loop—“generate → evaluate → optimize → regenerate”—by jointly training a data generator and a reward model, enabling continuous autonomous generation of high-fidelity multimodal medical data. It integrates generative reward modeling, SFT, and group-relative policy optimization (GRPO), facilitating efficient training of compact models. Experiments show that SFT using solely generated data already surpasses baselines trained on large-scale human-annotated datasets; further incorporating GRPO achieves state-of-the-art performance across cross-modal and cross-task benchmarks, with small models matching the performance of foundation models ten times larger in parameter count.
📝 Abstract
The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR$^2$), a novel framework that creates a self-improving virtuous cycle. MedGR$^2$ co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR$^2$-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR$^2$, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR$^2$ presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.