🤖 AI Summary
Medical vision-language models (VLMs) face practical deployment bottlenecks—including poor generalization, opaque reasoning, and high computational overhead.
Method: We propose RARL, a reasoning-aware reinforcement learning framework that jointly optimizes diagnostic accuracy and reasoning quality via a novel multi-objective reward function. RARL innovatively integrates LoRA with reinforcement learning for efficient medical VLM fine-tuning and enhances interpretability and robustness through diverse training prompts and dynamic inference-time prompting. Our lightweight implementation, built upon Qwen2-VL-2B-Instruct, is fully trainable and deployable on a single A100-40GB GPU.
Results: RARL achieves a 7.78% absolute improvement in reasoning accuracy over supervised fine-tuning (SFT); boosts cross-dataset generalization by 27% (vs. SFT) and 4% (vs. conventional RL); and demonstrates significantly enhanced reasoning quality, as validated by LLM-as-judge automated evaluation.
📝 Abstract
The growing integration of vision-language models (VLMs) in medical applications offers promising support for diagnostic reasoning. However, current medical VLMs often face limitations in generalization, transparency, and computational efficiency-barriers that hinder deployment in real-world, resource-constrained settings. To address these challenges, we propose a Reasoning-Aware Reinforcement Learning framework, extbf{RARL}, that enhances the reasoning capabilities of medical VLMs while remaining efficient and adaptable to low-resource environments. Our approach fine-tunes a lightweight base model, Qwen2-VL-2B-Instruct, using Low-Rank Adaptation and custom reward functions that jointly consider diagnostic accuracy and reasoning quality. Training is performed on a single NVIDIA A100-PCIE-40GB GPU, demonstrating the feasibility of deploying such models in constrained environments. We evaluate the model using an LLM-as-judge framework that scores both correctness and explanation quality. Experimental results show that RARL significantly improves VLM performance in medical image analysis and clinical reasoning, outperforming supervised fine-tuning on reasoning-focused tasks by approximately 7.78%, while requiring fewer computational resources. Additionally, we demonstrate the generalization capabilities of our approach on unseen datasets, achieving around 27% improved performance compared to supervised fine-tuning and about 4% over traditional RL fine-tuning. Our experiments also illustrate that diversity prompting during training and reasoning prompting during inference are crucial for enhancing VLM performance. Our findings highlight the potential of reasoning-guided learning and reasoning prompting to steer medical VLMs toward more transparent, accurate, and resource-efficient clinical decision-making. Code and data are publicly available.