🤖 AI Summary
In medical visual question answering (Med-VQA), general-purpose vision-language models (VLMs) suffer from two critical bottlenecks: perception-reasoning misalignment and reasoning-answer inconsistency, further exacerbated by the scarcity of high-quality, diverse medical data. To address these challenges, we propose Consistency-Aware Preference Optimization (CAPO), a novel reinforcement learning–based framework. We introduce Med-Zero-17K—the first pure RL medical dataset—comprising 17K samples spanning 30+ imaging modalities and 24 clinical tasks. CAPO employs a multi-stage consistency reward function that jointly optimizes perception-reasoning alignment, reasoning-answer coherence, and rule-guided accuracy verification. Extensive experiments demonstrate that CAPO significantly outperforms strong baselines on both in-domain and cross-domain Med-VQA benchmarks. Notably, it achieves the first effective generalization to 3D medical QA evaluation, while remaining fully compatible with R1-style training paradigms.
📝 Abstract
In medical visual question answering (Med-VQA), achieving accurate responses relies on three critical steps: precise perception of medical imaging data, logical reasoning grounded in visual input and textual questions, and coherent answer derivation from the reasoning process. Recent advances in general vision-language models (VLMs) show that large-scale reinforcement learning (RL) could significantly enhance both reasoning capabilities and overall model performance. However, their application in medical domains is hindered by two fundamental challenges: 1) misalignment between perceptual understanding and reasoning stages, and 2) inconsistency between reasoning pathways and answer generation, both compounded by the scarcity of high-quality medical datasets for effective large-scale RL. In this paper, we first introduce Med-Zero-17K, a curated dataset for pure RL-based training, encompassing over 30 medical image modalities and 24 clinical tasks. Moreover, we propose a novel large-scale RL framework for Med-VLMs, Consistency-Aware Preference Optimization (CAPO), which integrates rewards to ensure fidelity between perception and reasoning, consistency in reasoning-to-answer derivation, and rule-based accuracy for final responses. Extensive experiments on both in-domain and out-of-domain scenarios demonstrate the superiority of our method over strong VLM baselines, showcasing strong generalization capability to 3D Med-VQA benchmarks and R1-like training paradigms.