Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Medical image grounding faces challenges including narrow modality coverage, coarse-grained annotations, and the absence of a unified framework. To address these, we introduce Med-GLIP-5M—the first large-scale, multimodal medical vision-language grounding dataset—comprising seven imaging modalities and 5.3 million region-level fine-grained annotations covering both anatomical and pathological structures. We further propose Med-GLIP, a modality-aware unified grounding framework that integrates region-level vision-language contrastive pretraining, modality-specific prompt learning, and hierarchical label modeling. Without requiring expert-designed modules, Med-GLIP implicitly captures multi-granularity semantics (e.g., organ–lesion relationships) and jointly supports grounding and segmentation. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple medical grounding benchmarks. Moreover, Med-GLIP consistently enhances downstream performance in medical visual question answering and report generation, advancing interpretable and fine-grained medical visual understanding.

Technology Category

Application Category

📝 Abstract

Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data -- enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Aligning natural language phrases with medical image regions

Overcoming limited modality coverage and coarse annotations

Creating a unified framework for medical image grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with 5.3M annotations

Modality-aware framework for medical grounding

Hierarchical semantic understanding without expert modules

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model