HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Medical vision-language models (Med-VLMs) suffer from unreliable clinical responses due to modality misalignment. To address this, we propose a hierarchical self-contrastive reward mechanism that enables low-cost generation of high-quality preference data and facilitates fine-grained, context-aware alignment optimization. Our key contributions are: (1) the first implicit alignment reward function triggered by visual token dropout, which quantifies alignment quality without explicit supervision; and (2) a multi-level preference optimization strategy that constructs low-quality responses via hallucination-based replacement to elicit implicit relative quality signals. Evaluated on Med-VQA, medical image captioning, and instruction-following tasks, our method achieves significant improvements in zero-shot performance, cross-modal alignment fidelity, and clinical credibility—despite requiring only 2,000 preference samples. The approach bridges the gap between visual perception and clinical reasoning while preserving computational efficiency and scalability.

Technology Category

Application Category

📝 Abstract

Medical Vision-Language Models (Med-VLMs) have achieved success across various tasks, yet most existing methods overlook the modality misalignment issue that can lead to untrustworthy responses in clinical settings. In this paper, we propose Hierarchical Self-Contrastive Rewarding (HSCR), a novel approach that addresses two critical challenges in Med-VLM alignment: 1) Cost-effective generation of high-quality preference data; 2) Capturing nuanced and context-aware preferences for improved alignment. HSCR first leverages the inherent capability of Med-VLMs to generate dispreferred responses with higher sampling probability. By analyzing output logit shifts after visual token dropout, we identify modality-coupled tokens that induce misalignment and derive an implicit alignment reward function. This function guides token replacement with hallucinated ones during decoding, producing high-quality dispreferred data. Furthermore, HSCR introduces a multi-level preference optimization strategy, which extends beyond traditional adjacent-level optimization by incorporating nuanced implicit preferences, leveraging relative quality in dispreferred data to capture subtle alignment cues for more precise and context-aware optimization. Extensive experiments across multiple medical tasks, including Med-VQA, medical image captioning and instruction following, demonstrate that HSCR not only enhances zero-shot performance but also significantly improves modality alignment and trustworthiness with just 2,000 training entries.

Problem

Research questions and friction points this paper is trying to address.

Addresses modality misalignment in Medical Vision-Language Models

Generates cost-effective high-quality preference data for alignment

Captures nuanced context-aware preferences for precise optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates dispreferred responses via token dropout

Uses implicit alignment reward for token replacement

Multi-level optimization with nuanced implicit preferences

🔎 Similar Papers

MeDSLIP: Medical Dual-Stream Language-Image Pre-training with Pathology-Anatomy Semantic Alignment