CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation

๐Ÿ“… 2025-05-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Medical image segmentation faces two key challenges: (1) sparse annotations in public datasets hinder cross-dataset anatomical representation learning; and (2) purely vision-based models struggle to capture complex anatomical relationships and task-specific variations, limiting accuracy and generalizability. To address these, we propose a text-guided hybrid vision segmentation framework featuring a novel CLIP-DINO dual-encoder collaboration mechanism and a Text-driven Task Prompt Generation (TTPG) moduleโ€”enabling cross-dataset anatomical semantic alignment and fine-grained task discrimination. Our method integrates DINOv2 self-supervised ViT and CNN backbones, multi-head cross-modal attention, CLIP text embedding projection, and learnable text prompt modulation. Evaluated across multiple medical imaging benchmarks, it significantly outperforms state-of-the-art methods in segmentation accuracy, cross-dataset generalization, and robustness to partial annotations.

Technology Category

Application Category

๐Ÿ“ Abstract
Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model's ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex anatomical relationships and task-specific distinctions, leading to reduced segmentation accuracy and poor generalizability to unseen datasets. In this study, we proposed a novel CLIP-DINO Prompt-Driven Segmentation Network (CDPDNet), which combined a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges. Specifically, the framework was constructed upon a convolutional neural network (CNN) and incorporated DINOv2 to extract both fine-grained and global visual features, which were then fused using a multi-head cross-attention module to overcome the limited long-range modeling capability of CNNs. In addition, CLIP-derived text embeddings were projected into the visual space to help model complex relationships among organs and tumors. To further address the partial label challenge and enhance inter-task discriminative capability, a Text-based Task Prompt Generation (TTPG) module that generated task-specific prompts was designed to guide the segmentation. Extensive experiments on multiple medical imaging datasets demonstrated that CDPDNet consistently outperformed existing state-of-the-art segmentation methods. Code and pretrained model are available at: https://github.com/wujiong-hub/CDPDNet.git.
Problem

Research questions and friction points this paper is trying to address.

Addresses incomplete annotations in medical image datasets
Enhances anatomical relationship modeling with text guidance
Improves segmentation accuracy and cross-dataset generalizability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines CLIP text embedding with DINOv2 vision features
Uses multi-head cross-attention for feature fusion
Generates task-specific text prompts for segmentation
๐Ÿ”Ž Similar Papers
No similar papers found.
Jiong Wu
Jiong Wu
University of Florida
medical image analysis
Y
Yang Xing
J. Crayton Pruitt Family Department of Biomedical Engineering, University of Florida, Gainesville, FL, 32611, USA
Boxiao Yu
Boxiao Yu
University of Florida
Deep LearningPET
W
Wei Shao
Department of Medicine, University of Florida, Gainesville, FL, 32611, USA
Kuang Gong
Kuang Gong
Assistant Professor of Biomedical Engineering, University of Florida
PETMRICTInverse ProblemMachine Learning