CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

In medical visual question answering (Med-VQA), general-purpose vision-language models (VLMs) suffer from two critical bottlenecks: perception-reasoning misalignment and reasoning-answer inconsistency, further exacerbated by the scarcity of high-quality, diverse medical data. To address these challenges, we propose Consistency-Aware Preference Optimization (CAPO), a novel reinforcement learning–based framework. We introduce Med-Zero-17K—the first pure RL medical dataset—comprising 17K samples spanning 30+ imaging modalities and 24 clinical tasks. CAPO employs a multi-stage consistency reward function that jointly optimizes perception-reasoning alignment, reasoning-answer coherence, and rule-guided accuracy verification. Extensive experiments demonstrate that CAPO significantly outperforms strong baselines on both in-domain and cross-domain Med-VQA benchmarks. Notably, it achieves the first effective generalization to 3D medical QA evaluation, while remaining fully compatible with R1-style training paradigms.

Technology Category

Application Category

📝 Abstract

In medical visual question answering (Med-VQA), achieving accurate responses relies on three critical steps: precise perception of medical imaging data, logical reasoning grounded in visual input and textual questions, and coherent answer derivation from the reasoning process. Recent advances in general vision-language models (VLMs) show that large-scale reinforcement learning (RL) could significantly enhance both reasoning capabilities and overall model performance. However, their application in medical domains is hindered by two fundamental challenges: 1) misalignment between perceptual understanding and reasoning stages, and 2) inconsistency between reasoning pathways and answer generation, both compounded by the scarcity of high-quality medical datasets for effective large-scale RL. In this paper, we first introduce Med-Zero-17K, a curated dataset for pure RL-based training, encompassing over 30 medical image modalities and 24 clinical tasks. Moreover, we propose a novel large-scale RL framework for Med-VLMs, Consistency-Aware Preference Optimization (CAPO), which integrates rewards to ensure fidelity between perception and reasoning, consistency in reasoning-to-answer derivation, and rule-based accuracy for final responses. Extensive experiments on both in-domain and out-of-domain scenarios demonstrate the superiority of our method over strong VLM baselines, showcasing strong generalization capability to 3D Med-VQA benchmarks and R1-like training paradigms.

Problem

Research questions and friction points this paper is trying to address.

Misalignment between perception and reasoning in Med-VQA

Inconsistency between reasoning and answer generation in Med-VQA

Scarcity of high-quality medical datasets for large-scale RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Med-Zero-17K dataset for RL training

Proposes CAPO framework for consistent reasoning

Ensures fidelity between perception and reasoning

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting