🤖 AI Summary
Medical visual language models (VLMs) suffer from limited generalization, poor interpretability, and insufficient clinical traceability and regulatory compliance in multimodal medical imaging reasoning. To address these challenges, we propose the first reinforcement learning–enhanced training framework for medical VLMs incorporating Group Relative Policy Optimization (GRPO). Built upon the lightweight Qwen2-VL-2B architecture, our method achieves state-of-the-art performance across eight medical imaging modalities—including CT, MRI, and ultrasound—and five distinct reasoning tasks. It improves accuracy by 29.94% and cross-task generalization by 32.06% over baselines, while substantially outperforming the much larger Qwen2-VL-72B despite using only 1/36 of its parameters. Crucially, the model generates structured, auditable reasoning paths, thereby enhancing both predictive performance and clinical trustworthiness—directly supporting regulatory requirements for transparency and traceability in AI-assisted diagnosis.
📝 Abstract
Vision-language models (VLMs) have advanced reasoning in natural scenes, but their role in medical imaging remains underexplored. Medical reasoning tasks demand robust image analysis and well-justified answers, posing challenges due to the complexity of medical images. Transparency and trustworthiness are essential for clinical adoption and regulatory compliance. We introduce Med-R1, a framework exploring reinforcement learning (RL) to enhance VLMs' generalizability and trustworthiness in medical reasoning. Leveraging the DeepSeek strategy, we employ Group Relative Policy Optimization (GRPO) to guide reasoning paths via reward signals. Unlike supervised fine-tuning (SFT), which often overfits and lacks generalization, RL fosters robust and diverse reasoning. Med-R1 is evaluated across eight medical imaging modalities: CT, MRI, Ultrasound, Dermoscopy, Fundus Photography, Optical Coherence Tomography (OCT), Microscopy, and X-ray Imaging. Compared to its base model, Qwen2-VL-2B, Med-R1 achieves a 29.94% accuracy improvement and outperforms Qwen2-VL-72B, which has 36 times more parameters. Testing across five question types-modality recognition, anatomy identification, disease diagnosis, lesion grading, and biological attribute analysis Med-R1 demonstrates superior generalization, exceeding Qwen2-VL-2B by 32.06% and surpassing Qwen2-VL-72B in question-type generalization. These findings show that RL improves medical reasoning and enables parameter-efficient models to outperform significantly larger ones. With interpretable reasoning outputs, Med-R1 represents a promising step toward generalizable, trustworthy, and clinically viable medical VLMs.