Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical image grounding faces challenges including narrow modality coverage, coarse-grained annotations, and the absence of a unified framework. To address these, we introduce Med-GLIP-5M—the first large-scale, multimodal medical vision-language grounding dataset—comprising seven imaging modalities and 5.3 million region-level fine-grained annotations covering both anatomical and pathological structures. We further propose Med-GLIP, a modality-aware unified grounding framework that integrates region-level vision-language contrastive pretraining, modality-specific prompt learning, and hierarchical label modeling. Without requiring expert-designed modules, Med-GLIP implicitly captures multi-granularity semantics (e.g., organ–lesion relationships) and jointly supports grounding and segmentation. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple medical grounding benchmarks. Moreover, Med-GLIP consistently enhances downstream performance in medical visual question answering and report generation, advancing interpretable and fine-grained medical visual understanding.

Technology Category

Application Category

📝 Abstract
Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data -- enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Aligning natural language phrases with medical image regions
Overcoming limited modality coverage and coarse annotations
Creating a unified framework for medical image grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with 5.3M annotations
Modality-aware framework for medical grounding
Hierarchical semantic understanding without expert modules
🔎 Similar Papers
No similar papers found.
Z
Ziye Deng
Zhejiang University
R
Ruihan He
Zhejiang University
Jiaxiang Liu
Jiaxiang Liu
Zhejiang University
Multimodal FusionMedical Image Analysis
Y
Yuan Wang
Zhejiang University
Z
Zijie Meng
Zhejiang University
Songtao Jiang
Songtao Jiang
Zhejiang University
Vision-Language ModelsAI for Bioinfomatics and Medical
Y
Yong Xie
Nanjing University of Posts and Telecommunications
Zuozhu Liu
Zuozhu Liu
Assistant Professor, Zhejiang University/University of Illinois Urbana-Champaign
deep learningvision-language modelsmedical AI