🤖 AI Summary
This study investigates the true source of performance gains from reinforcement learning (RL) in medical vision-language models (VLMs). Through controlled experiments on the MedMNIST multimodal benchmark, the work systematically disentangles RL’s individual contributions to visual perception, supervised fine-tuning (SFT) support, and output distribution optimization. The findings reveal that RL primarily refines the output distribution under high-support conditions rather than enhancing reasoning capabilities. Building on this insight, the authors propose a boundary-aware RL training strategy. Evaluation across six medical visual question answering (VQA) benchmarks demonstrates that this approach significantly improves Accuracy@1 and Pass@K metrics. Notably, RL yields benefits only when the model already possesses a sufficiently high initial level of support from SFT.
📝 Abstract
Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.