🤖 AI Summary
To address the excessive GPU memory consumption of orthogonal fine-tuning—caused by storing full-dimensional sparse matrix intermediate activations—this paper proposes a memory-efficient, principal-subspace-adaptive orthogonal fine-tuning method. Methodologically, it (i) establishes, for the first time, theoretical conditions under which low-rank orthogonal transformations preserve hyperspherical energy; (ii) restricts orthogonal constraints to the dominant singular value decomposition (SVD) subspace, drastically reducing both parameter and activation memory footprints; and (iii) introduces learnable diagonal scaling vectors to enhance task generalization. Extensive experiments across 37 NLP and computer vision tasks using four large language/vision models demonstrate that our method reduces peak GPU memory usage by up to 62%, while maintaining or improving accuracy—outperforming mainstream parameter-efficient fine-tuning (PEFT) baselines including LoRA and AdaLoRA.
📝 Abstract
Driven by the relentless growth in model parameters, which renders full fine-tuning prohibitively expensive for large-scale deployment, parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for rapidly adapting large models to a wide range of downstream tasks. Among the PEFT family, orthogonal fine-tuning and its variants have demonstrated remarkable performance by preserving hyperspherical energy, which encodes pairwise angular similarity between neurons. However, these methods are inherently memory-inefficient due to the need to store intermediate activations from multiple full-dimensional sparse matrices. To address this limitation, we propose Memory-efficient Orthogonal Fine-Tuning (MOFT) with principal subspace adaptation. Specifically, we first establish a theoretical condition under which orthogonal transformations within a low-rank subspace preserve hyperspherical energy. Based on this insight, we constrain orthogonal fine-tuning to the principal subspace defined by the top-r components obtained through singular value decomposition and impose an additional constraint on the projection matrix to satisfy the preservation condition. To enhance MOFT's flexibility across tasks, we relax strict orthogonality by introducing two learnable scaling vectors. Extensive experiments on 37 diverse tasks and four models across NLP and CV demonstrate that MOFT consistently outperforms key baselines while significantly reducing the memory footprint of orthogonal fine-tuning.