🤖 AI Summary
In Vision Prompt Tuning (VPT), the interaction between image patch embeddings and Transformer query/key projections induces “burstiness”—a non-Gaussian phenomenon characterized by Laplacian or super-Laplacian output distributions, severely hindering prompt learning. This work is the first to identify and characterize this statistical pathology. We propose Whitened Bilinear VPT (WB-VPT): (i) a whitening transformation applied jointly to patch embeddings and Q/K projections to eliminate cross-dimensional correlations and equalize variances; (ii) a bilinear interaction module explicitly modeling burstiness. To improve efficiency, we further introduce a low-rank variant, LR-WB-VPT, which drastically reduces parameter count and accelerates convergence. Evaluated on benchmarks including CUB, our method achieves over 25 percentage points accuracy gain versus prior VPT approaches, while simultaneously reducing both model parameters and computational overhead—establishing new state-of-the-art performance across all key metrics.
📝 Abstract
Visual Prompt Tuning (VPT) is a parameter-efficient fune-tuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover ``burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Furthermore, the values of patch embeddings and the key and query projectors exhibit Laplacian and hyper-Laplacian distribution, respectively. Intuitively, these non-Gaussian distributions pose challenges for learning prompts. To address this, we propose whitening these data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts. We derive the whitening matrix over random image patch embeddings and ViT's key and query projectors, and multiply it with the prompt to be learned in a bilinear manner. Surprisingly, this method significantly accelerates prompt tuning and boosts accuracy, e.g., $>$25 accuracy points on the CUB dataset; interestingly, it learns ``bursty prompts''. Extending the bilinear model which is known to introduce burstiness, we present a compact, low-rank version by learning two smaller matrices whose multiplication yields the final prompts. We call the proposed methods Bilinear Prompt Tuning (BPT). Extensive experiments across multiple benchmark datasets demonstrate that BPT methods not only outperform various VPT methods but also reduce parameter count and computation overhead.