🤖 AI Summary
This work exposes the “benchmark success, clinical failure” paradox of reinforcement learning (RL) in medical vision-language models (VLMs): while RL methods such as GRPO improve macro-F1 on CheXpert by 0.346 (+23%), they degrade performance on the NIH ChestX-ray dataset by 19%, severely compromising cross-institutional generalization. We provide the first systematic empirical evidence that RL optimization induces generalization degradation in medical VLMs and identify that supervised fine-tuning (SFT) checkpoints inherently exhibit superior cross-dataset robustness. Leveraging an R1-style training paradigm and the ChexReason model, we conduct efficient experiments on a single A100 GPU, confirming that structured reasoning yields limited gains over standard medical pretraining. Our core contribution is establishing a practical guideline for clinical deployment—favoring lightweight SFT over aggressive RL—thereby offering a transferable, trustworthy, and resource-efficient optimization pathway for medical AI under computational constraints.
📝 Abstract
Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.