Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work exposes the “benchmark success, clinical failure” paradox of reinforcement learning (RL) in medical vision-language models (VLMs): while RL methods such as GRPO improve macro-F1 on CheXpert by 0.346 (+23%), they degrade performance on the NIH ChestX-ray dataset by 19%, severely compromising cross-institutional generalization. We provide the first systematic empirical evidence that RL optimization induces generalization degradation in medical VLMs and identify that supervised fine-tuning (SFT) checkpoints inherently exhibit superior cross-dataset robustness. Leveraging an R1-style training paradigm and the ChexReason model, we conduct efficient experiments on a single A100 GPU, confirming that structured reasoning yields limited gains over standard medical pretraining. Our core contribution is establishing a practical guideline for clinical deployment—favoring lightweight SFT over aggressive RL—thereby offering a transferable, trustworthy, and resource-efficient optimization pathway for medical AI under computational constraints.

Technology Category

Application Category

📝 Abstract

Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement learning improves benchmarks but reduces cross-dataset generalization in medical imaging.

There is a tension between in-distribution performance and transferability to diverse clinical populations.

Supervised fine-tuning may be more robust than aggressive RL for clinical deployment.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language model trained with R1-style methodology

Uses only 2000 SFT and 1000 RL samples

Optimized on single A100 GPU for efficiency

🔎 Similar Papers

No similar papers found.