Attention to Burstiness: Low-Rank Bilinear Prompt Tuning

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In Vision Prompt Tuning (VPT), the interaction between image patch embeddings and Transformer query/key projections induces “burstiness”—a non-Gaussian phenomenon characterized by Laplacian or super-Laplacian output distributions, severely hindering prompt learning. This work is the first to identify and characterize this statistical pathology. We propose Whitened Bilinear VPT (WB-VPT): (i) a whitening transformation applied jointly to patch embeddings and Q/K projections to eliminate cross-dimensional correlations and equalize variances; (ii) a bilinear interaction module explicitly modeling burstiness. To improve efficiency, we further introduce a low-rank variant, LR-WB-VPT, which drastically reduces parameter count and accelerates convergence. Evaluated on benchmarks including CUB, our method achieves over 25 percentage points accuracy gain versus prior VPT approaches, while simultaneously reducing both model parameters and computational overhead—establishing new state-of-the-art performance across all key metrics.

Technology Category

Application Category

📝 Abstract
Visual Prompt Tuning (VPT) is a parameter-efficient fune-tuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover ``burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Furthermore, the values of patch embeddings and the key and query projectors exhibit Laplacian and hyper-Laplacian distribution, respectively. Intuitively, these non-Gaussian distributions pose challenges for learning prompts. To address this, we propose whitening these data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts. We derive the whitening matrix over random image patch embeddings and ViT's key and query projectors, and multiply it with the prompt to be learned in a bilinear manner. Surprisingly, this method significantly accelerates prompt tuning and boosts accuracy, e.g., $>$25 accuracy points on the CUB dataset; interestingly, it learns ``bursty prompts''. Extending the bilinear model which is known to introduce burstiness, we present a compact, low-rank version by learning two smaller matrices whose multiplication yields the final prompts. We call the proposed methods Bilinear Prompt Tuning (BPT). Extensive experiments across multiple benchmark datasets demonstrate that BPT methods not only outperform various VPT methods but also reduce parameter count and computation overhead.
Problem

Research questions and friction points this paper is trying to address.

Address burstiness in Visual Prompt Tuning (VPT)
Whitening data to improve prompt learning efficiency
Reduce parameters and computation in bilinear prompt tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Whitening data for Gaussian distribution before prompt learning
Low-rank bilinear prompt tuning reduces parameters
Bursty prompts improve accuracy and tuning speed
🔎 Similar Papers
No similar papers found.
Y
Yuzhu Wang
Zhejiang Lab
M
Manni Duan
Zhejiang Lab
Shu Kong
Shu Kong
Texas A&M University
Computer VisionMachine Learning