Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses zero-shot phrase grounding in medical imaging—localizing lesion regions from natural language phrases without annotated training data. Methodologically, it leverages the inherent cross-attention mechanism of generative text-to-image diffusion models to produce initial localization maps, and introduces a plug-and-play Bimodal Bias Matching (BBM) post-processing technique to align bias distributions between textual and visual feature spaces, thereby refining attention maps. It further validates the efficacy of fine-tuning domain-specific frozen language models (e.g., CX-RBERT). Compared to existing discriminative approaches, the method achieves approximately 100% improvement in mean Intersection-over-Union (mIoU) under zero-shot settings, significantly enhancing localization accuracy. Key contributions include: (i) the first application of generative diffusion models to medical phrase grounding; (ii) the proposal of BBM as a lightweight, modality-agnostic cross-modal alignment mechanism; and (iii) empirical confirmation that the freeze-and-fine-tune paradigm outperforms full fine-tuning in low-resource medical vision-language modeling.

Technology Category

Application Category

📝 Abstract

Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at https://github.com/Felix-012/generate_to_ground.

Problem

Research questions and friction points this paper is trying to address.

Improving phrase grounding in medical imaging using generative models

Enhancing disease localization via text-to-image diffusion models

Aligning text and image biases for better region identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative text-to-image diffusion models boost grounding

Fine-tuning with domain-specific language model enhances performance

Bimodal Bias Merging refines cross-attention maps accuracy

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model