MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that existing vision-language models struggle to accurately capture fine-grained regional semantics in medical images. To this end, we propose MedP-CLIP, a region-aware medical vision-language model that integrates medical prior knowledge and supports diverse regional prompts—including points, bounding boxes, and masks. Its core innovation lies in a feature-level regional prompt fusion mechanism that enables flexible responses to local regions while preserving global context. MedP-CLIP is the first model to achieve fine-grained spatial semantic understanding on large-scale, cross-disease, and cross-modality medical data. Built upon a contrastive learning framework and pretrained with regional prompt embeddings and medical priors, MedP-CLIP significantly outperforms baseline methods in zero-shot recognition, interactive segmentation, and multimodal large model enhancement tasks, serving as a plug-and-play visual backbone for medical applications.

Technology Category

Application Category

📝 Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated outstanding performance in global image understanding and zero-shot transfer through large-scale text-image alignment. However, the core of medical image analysis often lies in the fine-grained understanding of specific anatomical structures or lesion regions. Therefore, precisely comprehending region-of-interest (RoI) information provided by medical professionals or perception models becomes crucial. To address this need, we propose MedP-CLIP, a region-aware medical vision-language model (VLM). MedP-CLIP innovatively integrates medical prior knowledge and designs a feature-level region prompt integration mechanism, enabling it to flexibly respond to various prompt forms (e.g., points, bounding boxes, masks) while maintaining global contextual awareness when focusing on local regions. We pre-train the model on a meticulously constructed large-scale dataset (containing over 6.4 million medical images and 97.3 million region-level annotations), equipping it with cross-disease and cross-modality fine-grained spatial semantic understanding capabilities. Experiments demonstrate that MedP-CLIP significantly outperforms baseline methods in various medical tasks, including zero-shot recognition, interactive segmentation, and empowering multimodal large language models. This model provides a scalable, plug-and-play visual backbone for medical AI, combining holistic image understanding with precise regional analysis.

Problem

Research questions and friction points this paper is trying to address.

medical image analysis

region-of-interest

fine-grained understanding

vision-language model

CLIP

Innovation

Methods, ideas, or system contributions that make the work stand out.

region-aware prompting

medical vision-language model

feature-level prompt integration

cross-modality understanding

zero-shot medical recognition

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model

2024-04-10arXiv.orgCitations: 5

Authors to Follow