🤖 AI Summary
Medical visual language models (VLMs) suffer from hallucination in chest X-ray analysis, undermining clinical reliability. Existing reinforcement learning approaches (e.g., GRPO) rely on sparse outcome-based rewards, yielding verbose, unverifiable reasoning that obscures factual errors. To address this, we propose a novel alignment paradigm—*process supervision over outcome supervision*: we structure reasoning into “disease–relation–anatomy” triplets, construct a fine-grained knowledge graph, and introduce the first *entity–relation matching consistency reward* mechanism. We further incorporate hard-sample mining and dual atomic-level constraints—logical coherence and factual accuracy. Evaluated on MIMIC-CXR-VQA, our method achieves state-of-the-art performance, surpassing prior approaches using only 5K training samples. It significantly reduces hallucination and redundancy, generating concise, verifiable, and clinically trustworthy chain-of-thought explanations.
📝 Abstract
Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.