When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the true source of performance gains from reinforcement learning (RL) in medical vision-language models (VLMs). Through controlled experiments on the MedMNIST multimodal benchmark, the work systematically disentangles RL’s individual contributions to visual perception, supervised fine-tuning (SFT) support, and output distribution optimization. The findings reveal that RL primarily refines the output distribution under high-support conditions rather than enhancing reasoning capabilities. Building on this insight, the authors propose a boundary-aware RL training strategy. Evaluation across six medical visual question answering (VQA) benchmarks demonstrates that this approach significantly improves Accuracy@1 and Pass@K metrics. Notably, RL yields benefits only when the model already possesses a sufficiently high initial level of support from SFT.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Medical Vision-Language Models
Supervised Fine-Tuning
Visual Reasoning
Post-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning
Supervised Fine-Tuning
Medical Vision-Language Models
Visual Reasoning
Sampling Efficiency
🔎 Similar Papers
No similar papers found.
A
Ahmadreza Jeddi
University of Toronto, Canada; Vector Institute, Canada; KITE Research Institute, University Health Network, Canada
K
Kimia Shaban
University of Toronto, Canada; Vector Institute, Canada; KITE Research Institute, University Health Network, Canada
Negin Baghbanzadeh
Negin Baghbanzadeh
Graduate Student, York University, Vector Institute
computer visionmultimodal representation learning
N
Natasha Sharan
University of Toronto, Canada
A
Abhishek Moturu
University of Toronto, Canada; Vector Institute, Canada; KITE Research Institute, University Health Network, Canada
Elham Dolatabadi
Elham Dolatabadi
York University; Vector Institute; University of Toronto
Artificial Intelligencemachine learningHealthCareData Science
Babak Taati
Babak Taati
KITE Research Institute |Toronto Rehab - UHN & Department of Computer Science, University of Toronto
Computer VisionHealth MonitoringAmbient Intelligence