🤖 AI Summary
This work addresses zero-shot phrase grounding in medical imaging—localizing lesion regions from natural language phrases without annotated training data. Methodologically, it leverages the inherent cross-attention mechanism of generative text-to-image diffusion models to produce initial localization maps, and introduces a plug-and-play Bimodal Bias Matching (BBM) post-processing technique to align bias distributions between textual and visual feature spaces, thereby refining attention maps. It further validates the efficacy of fine-tuning domain-specific frozen language models (e.g., CX-RBERT). Compared to existing discriminative approaches, the method achieves approximately 100% improvement in mean Intersection-over-Union (mIoU) under zero-shot settings, significantly enhancing localization accuracy. Key contributions include: (i) the first application of generative diffusion models to medical phrase grounding; (ii) the proposal of BBM as a lightweight, modality-agnostic cross-modal alignment mechanism; and (iii) empirical confirmation that the freeze-and-fine-tune paradigm outperforms full fine-tuning in low-resource medical vision-language modeling.
📝 Abstract
Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at https://github.com/Felix-012/generate_to_ground.