When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study investigates the true source of performance gains from reinforcement learning (RL) in medical vision-language models (VLMs). Through controlled experiments on the MedMNIST multimodal benchmark, the work systematically disentangles RL’s individual contributions to visual perception, supervised fine-tuning (SFT) support, and output distribution optimization. The findings reveal that RL primarily refines the output distribution under high-support conditions rather than enhancing reasoning capabilities. Building on this insight, the authors propose a boundary-aware RL training strategy. Evaluation across six medical visual question answering (VQA) benchmarks demonstrates that this approach significantly improves Accuracy@1 and Pass@K metrics. Notably, RL yields benefits only when the model already possesses a sufficiently high initial level of support from SFT.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Medical Vision-Language Models

Supervised Fine-Tuning

Visual Reasoning

Post-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Supervised Fine-Tuning

Medical Vision-Language Models