🤖 AI Summary
To address the challenge of efficient fine-tuning of vision large models under resource-constrained conditions, this paper proposes Serial LoRA—a novel low-rank adaptation method tailored for Vision Transformer (ViT) architectures. Its core innovation lies in serially embedding a shared low-rank matrix within attention modules, enabling structured parameter compression by exploiting intrinsic commonalities among adaptation parameters. Compared to standard LoRA, Serial LoRA reduces trainable parameters by 75% (requiring only 25% of the original count), significantly lowering memory footprint and GPU memory consumption, while maintaining comparable downstream task performance across multiple ViT backbones. The method integrates low-rank decomposition, attention module reconstruction, and cross-head/cross-layer parameter sharing—without introducing additional inference latency. Serial LoRA thus establishes a lightweight, general-purpose, and high-performance paradigm for parameter-efficient fine-tuning (PEFT), particularly suitable for edge-device deployment and large-scale applications.
📝 Abstract
Fine-tuning large pre-trained vision foundation models in a parameter-efficient manner is critical for downstream vision tasks, considering the practical constraints of computational and storage costs. Low-rank adaptation (LoRA) is a well-established technique in this domain, achieving impressive efficiency by reducing the parameter space to a low-rank form. However, developing more advanced low-rank adaptation methods to reduce parameters and memory requirements remains a significant challenge in resource-constrained application scenarios. In this study, we consider on top of the commonly used vision transformer and propose Serial LoRA, a novel LoRA variant that introduces a shared low-rank matrix serially composite with the attention mechanism. Such a design extracts the underlying commonality of parameters in adaptation, significantly reducing redundancy. Notably, Serial LoRA uses only 1/4 parameters of LoRA but achieves comparable performance in most cases. We conduct extensive experiments on a range of vision foundation models with the transformer structure, and the results confirm consistent superiority of our method.