DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual prompt tuning (VPT) methods fail to model the semantic relevance and distributional characteristics between learnable prompts and image tokens. To address this, we propose Semantic-Guided Visual Prompt Tuning (SG-VPT), the first VPT framework to introduce a semantic-aware distribution alignment mechanism. SG-VPT employs metric learning to explicitly model semantic distribution relationships among prompts, image patches, and class tokens, thereby transforming prompts into semantic-sharing bridges. Our method comprises three key components: (1) semantic-driven prompt initialization and update, (2) distribution-aware prompt optimization, and (3) a lightweight fine-tuning paradigm with a frozen ViT backbone. Evaluated on multiple image classification and segmentation benchmarks, SG-VPT achieves consistent improvements—gaining 1.2–2.8% in Top-1 accuracy over state-of-the-art VPT methods—while introducing less than 0.1% additional parameters and accelerating training by 40%. Crucially, it significantly enhances the semantic mediation capability of visual prompts.

Technology Category

Application Category

📝 Abstract
Visual Prompt Tuning (VPT) has become a promising solution for Parameter-Efficient Fine-Tuning (PEFT) approach for Vision Transformer (ViT) models by partially fine-tuning learnable tokens while keeping most model parameters frozen. Recent research has explored modifying the connection structures of the prompts. However, the fundamental correlation and distribution between the prompts and image tokens remain unexplored. In this paper, we leverage metric learning techniques to investigate how the distribution of prompts affects fine-tuning performance. Specifically, we propose a novel framework, Distribution Aware Visual Prompt Tuning (DA-VPT), to guide the distributions of the prompts by learning the distance metric from their class-related semantic data. Our method demonstrates that the prompts can serve as an effective bridge to share semantic information between image patches and the class token. We extensively evaluated our approach on popular benchmarks in both recognition and segmentation tasks. The results demonstrate that our approach enables more effective and efficient fine-tuning of ViT models by leveraging semantic information to guide the learning of the prompts, leading to improved performance on various downstream vision tasks.
Problem

Research questions and friction points this paper is trying to address.

Explores prompt-image token correlation in Vision Transformers
Proposes DA-VPT to guide prompts using semantic data
Improves ViT fine-tuning efficiency and downstream task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses metric learning for prompt distribution
Guides prompts with semantic class data
Improves ViT fine-tuning efficiency
🔎 Similar Papers
No similar papers found.