🤖 AI Summary
Existing Vision Transformer (ViT) adaptation methods typically require architectural modifications or introduce additional parameters, incurring substantial computational overhead and deployment complexity. To address this, we propose Attention Projection Layer Adaptation (APLA), the first method to identify the linear projection layer immediately following the self-attention mechanism as the critical bottleneck for efficient adaptation. APLA enables effective fine-tuning by updating only this layer—or a randomly sampled subset thereof—without altering the model architecture, adding parameters, or incurring inference-time overhead. Leveraging intrinsic ViT structural insights, APLA designs a lightweight parameter update strategy that avoids auxiliary modules, gradient recomputation, or architectural changes. Evaluated across 46 cross-domain datasets spanning classification, segmentation, and detection tasks, APLA consistently outperforms 17 state-of-the-art adaptation methods, achieving significant average accuracy gains, a 52.63% reduction in GPU memory consumption, and a 43.0% speedup in training.
📝 Abstract
Existing adaptation techniques typically require architectural modifications or added parameters, leading to high computational costs and complexity. We introduce Attention Projection Layer Adaptation (APLA), a simple approach to adapt vision transformers (ViTs) without altering the architecture or adding parameters. Through a systematic analysis, we find that the layer immediately after the attention mechanism is crucial for adaptation. By updating only this projection layer, or even just a random subset of this layer's weights, APLA achieves state-of-the-art performance while reducing GPU memory usage by up to 52.63% and training time by up to 43.0%, with no extra cost at inference. Across 46 datasets covering a variety of tasks including scene classification, medical imaging, satellite imaging, and fine-grained classification, APLA consistently outperforms 17 other leading adaptation methods, including full fine-tuning, on classification, segmentation, and detection tasks. The code is available at https://github.com/MoeinSorkhei/APLA.