🤖 AI Summary
Traditional vision prompting tuning (VPT) employs global, static prompts, limiting adaptability to heterogeneous downstream datasets and impairing generalization. To address this, we propose ViaPT—a novel instance-aware visual prompting tuning framework. ViaPT introduces, for the first time, an instance-feature-driven dynamic prompt generation mechanism that adaptively fuses dataset-level priors with instance-level semantics. We theoretically show that VPT-Shallow and VPT-Deep emerge as boundary cases of ViaPT. Furthermore, ViaPT integrates principal component analysis (PCA) for prompt dimensionality reduction, significantly decreasing learnable parameters while preserving essential discriminative information. Extensive experiments across 34 diverse downstream datasets demonstrate that ViaPT consistently outperforms state-of-the-art methods. Our approach establishes a new prompting tuning paradigm that jointly optimizes efficiency, generalization, and interpretability—without compromising performance.
📝 Abstract
Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information. Moreover, we reveal that VPT-Deep and VPT-Shallow represent two corner cases based on a conceptual understanding, in which they fail to effectively capture instance-specific information, while random dimension reduction on prompts only yields performance between the two extremes. Instead, ViaPT overcomes these limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters compared to VPT-Deep. Extensive experiments across 34 diverse datasets demonstrate that our method consistently outperforms state-of-the-art baselines, establishing a new paradigm for analyzing and optimizing visual prompts for vision transformers.