MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

📅 2024-12-09
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Medical visual-language models (Med-VLMMs) suffer from modality misalignment, leading to clinically implausible hallucinations inconsistent with input images; existing preference optimization methods neglect clinical relevance, yielding highly distinguishable negative samples and suboptimal alignment. To address this, we propose a clinically aware multimodal preference optimization framework: (1) a novel clinical-relevance weighting scheme that quantifies clinical importance via ensemble outputs from multiple medical large language models and vision tools; (2) two clinically grounded negative sample types—plausible hallucinations and lesion-region neglect—to strengthen critical visual understanding; and (3) localized lesion perturbation, joint hallucination generation using GPT-4o and the target model, and weighted Direct Preference Optimization (DPO). Evaluated on Med-VQA and radiology report generation, our method improves factual accuracy by 14.2% and 51.7%, respectively, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately mitigated clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. To address this challenge, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, and integrate these scores into the preference optimization process as weights, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in https://github.com/aiming-lab/MMedPO.
Problem

Research questions and friction points this paper is trying to address.

Addressing modality misalignment in Medical LVLMs causing factual hallucinations
Enhancing clinical relevance in preference data for Med-LVLM alignment
Improving visual understanding of critical lesion areas in medical images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clinical-aware multimodal preference optimization for Med-LVLMs
Introduces plausible hallucinations and lesion neglect
Integrates clinical relevance scores as optimization weights
🔎 Similar Papers
No similar papers found.