š¤ AI Summary
Current medical vision-language models (VLMs) for radiology tasks produce only final answers without interpretable reasoning, undermining clinical trust and regulatory adoption. To address this, we propose a reinforcement learningādriven, explainable medical VLM framework. Our method introduces reference-free, self-supervised reasoning reward modelingāeliminating the need for human-annotated rationale chainsācombined with multimodal alignment fine-tuning and domain-adaptive reasoning strategies. Remarkably, it achieves state-of-the-art performance using only 600 training samples, outperforming large models trained on million-scale datasets. Built upon a 2-billion-parameter VLM architecture, our approach elevates accuracy on cross-modal radiological visual question answering (VQA) across MRI, CT, and X-ray from 55.11% to 78.22%, while significantly improving out-of-distribution generalization. The core contribution is the first demonstration of natural language reasoning generation in medical VLMs without manual rationale annotationāachieving a principled balance among efficiency, interpretability, and clinical utility.
š Abstract
Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice.