🤖 AI Summary
Medical image grounding faces challenges including narrow modality coverage, coarse-grained annotations, and the absence of a unified framework. To address these, we introduce Med-GLIP-5M—the first large-scale, multimodal medical vision-language grounding dataset—comprising seven imaging modalities and 5.3 million region-level fine-grained annotations covering both anatomical and pathological structures. We further propose Med-GLIP, a modality-aware unified grounding framework that integrates region-level vision-language contrastive pretraining, modality-specific prompt learning, and hierarchical label modeling. Without requiring expert-designed modules, Med-GLIP implicitly captures multi-granularity semantics (e.g., organ–lesion relationships) and jointly supports grounding and segmentation. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple medical grounding benchmarks. Moreover, Med-GLIP consistently enhances downstream performance in medical visual question answering and report generation, advancing interpretable and fine-grained medical visual understanding.
📝 Abstract
Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data -- enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.