Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
Existing prompt-learning methods for few-shot biomedical image classification neglect anatomical structures and pathological features while exhibiting insufficient cross-modal alignment. To address this, we propose a knowledge-enhanced dual-modal prompt tuning framework. Methodologically, the textual branch integrates template-driven and large language model (LLM)-generated clinical prompts, augmented via knowledge distillation to inject domain-specific expertise; the visual branch introduces a novel zero-initialized soft prompt that enables adaptive attention reweighting over diagnostically relevant regions. Evaluated on 11 datasets spanning nine imaging modalities and ten anatomical organs, our method achieves a mean accuracy of 66.14%, with base-class and novel-class accuracies of 78.06% and 75.97%, respectively—surpassing CoOp by 3.78–8.04 percentage points. This work is the first to jointly model clinical knowledge, anatomical priors, and dual-modal soft prompts, significantly improving few-shot generalization and cross-modal semantic alignment.

Technology Category

Application Category

📝 Abstract
Prompt learning is one of the most effective paradigms for adapting pre-trained vision-language models (VLMs) to the biomedical image classification tasks in few shot scenarios. However, most of the current prompt learning methods only used the text prompts and ignored the particular structures (such as the complex anatomical structures and subtle pathological features) in the biomedical images. In this work, we propose Biomed-DPT, a knowledge-enhanced dual modality prompt tuning technique. In designing the text prompt, Biomed-DPT constructs a dual prompt including the template-driven clinical prompts and the large language model (LLM)-driven domain-adapted prompts, then extracts the clinical knowledge from the domain-adapted prompts through the knowledge distillation technique. In designing the vision prompt, Biomed-DPT introduces the zero vector as a soft prompt to leverage attention re-weighting so that the focus on non-diagnostic regions and the recognition of non-critical pathological features are avoided. Biomed-DPT achieves an average classification accuracy of 66.14% across 11 biomedical image datasets covering 9 modalities and 10 organs, with performance reaching 78.06% in base classes and 75.97% in novel classes, surpassing the Context Optimization (CoOp) method by 6.20%, 3.78%, and 8.04%, respectively. Our code are available at underline{https://github.com/Kanyooo/Biomed-DPT}.
Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language models to biomedical image classification
Enhancing text and vision prompts for biomedical structures
Improving few-shot learning accuracy in diverse biomedical datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual modality prompt tuning for biomedical VLMs
Knowledge-enhanced text and vision prompt design
Zero vector soft prompt for attention re-weighting
🔎 Similar Papers
No similar papers found.