Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients

📅 2025-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes the “benchmark success, clinical failure” paradox of reinforcement learning (RL) in medical vision-language models (VLMs): while RL methods such as GRPO improve macro-F1 on CheXpert by 0.346 (+23%), they degrade performance on the NIH ChestX-ray dataset by 19%, severely compromising cross-institutional generalization. We provide the first systematic empirical evidence that RL optimization induces generalization degradation in medical VLMs and identify that supervised fine-tuning (SFT) checkpoints inherently exhibit superior cross-dataset robustness. Leveraging an R1-style training paradigm and the ChexReason model, we conduct efficient experiments on a single A100 GPU, confirming that structured reasoning yields limited gains over standard medical pretraining. Our core contribution is establishing a practical guideline for clinical deployment—favoring lightweight SFT over aggressive RL—thereby offering a transferable, trustworthy, and resource-efficient optimization pathway for medical AI under computational constraints.

Technology Category

Application Category

📝 Abstract
Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement learning improves benchmarks but reduces cross-dataset generalization in medical imaging.
There is a tension between in-distribution performance and transferability to diverse clinical populations.
Supervised fine-tuning may be more robust than aggressive RL for clinical deployment.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language model trained with R1-style methodology
Uses only 2000 SFT and 1000 RL samples
Optimized on single A100 GPU for efficiency
🔎 Similar Papers
No similar papers found.
Armin Berger
Armin Berger
Data Scientist, Fraunhofer IAIS & PhD Fellow in Machine Learning, University of Bonn
Machine LearningNatural Language Processing
M
Manuela Bergau
Fraunhofer IAIS, Germany; University of Bonn , Germany; Lamarr Institute, Germany
H
Helen Schneider
Fraunhofer IAIS, Germany
S
Saad Ahmad
Fraunhofer IAIS, Germany
T
Tom Anglim Lagones
Department of Health Queensland, Australia; Griffith University, Australia
G
Gianluca Brugnara
University Hospital Bonn, Germany
M
Martha Foltyn-Dumitru
University Hospital Bonn, Germany
K
Kai Schlamp
University Hospital Bonn, Germany
Philipp Vollmuth
Philipp Vollmuth
Prof for AI in Medical Imaging | Division Head Computational Radiology & Clinical AI | CCIBonn.ai
Machine LearningDeep LearningHealthcareRadiology
Rafet Sifa
Rafet Sifa
Professor of Machine Learning at University of Bonn and Fraunhofer IAIS
Machine LearningRepresentation LearningGame AnalyticsTextminingMedical Informatics