Memory-Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the excessive GPU memory consumption of orthogonal fine-tuning—caused by storing full-dimensional sparse matrix intermediate activations—this paper proposes a memory-efficient, principal-subspace-adaptive orthogonal fine-tuning method. Methodologically, it (i) establishes, for the first time, theoretical conditions under which low-rank orthogonal transformations preserve hyperspherical energy; (ii) restricts orthogonal constraints to the dominant singular value decomposition (SVD) subspace, drastically reducing both parameter and activation memory footprints; and (iii) introduces learnable diagonal scaling vectors to enhance task generalization. Extensive experiments across 37 NLP and computer vision tasks using four large language/vision models demonstrate that our method reduces peak GPU memory usage by up to 62%, while maintaining or improving accuracy—outperforming mainstream parameter-efficient fine-tuning (PEFT) baselines including LoRA and AdaLoRA.

Technology Category

Application Category

📝 Abstract

Driven by the relentless growth in model parameters, which renders full fine-tuning prohibitively expensive for large-scale deployment, parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for rapidly adapting large models to a wide range of downstream tasks. Among the PEFT family, orthogonal fine-tuning and its variants have demonstrated remarkable performance by preserving hyperspherical energy, which encodes pairwise angular similarity between neurons. However, these methods are inherently memory-inefficient due to the need to store intermediate activations from multiple full-dimensional sparse matrices. To address this limitation, we propose Memory-efficient Orthogonal Fine-Tuning (MOFT) with principal subspace adaptation. Specifically, we first establish a theoretical condition under which orthogonal transformations within a low-rank subspace preserve hyperspherical energy. Based on this insight, we constrain orthogonal fine-tuning to the principal subspace defined by the top-r components obtained through singular value decomposition and impose an additional constraint on the projection matrix to satisfy the preservation condition. To enhance MOFT's flexibility across tasks, we relax strict orthogonality by introducing two learnable scaling vectors. Extensive experiments on 37 diverse tasks and four models across NLP and CV demonstrate that MOFT consistently outperforms key baselines while significantly reducing the memory footprint of orthogonal fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory costs in orthogonal fine-tuning for large models

Preserving hyperspherical energy with low-rank subspace transformations

Enhancing task adaptability while maintaining parameter efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank subspace orthogonal fine-tuning

Principal subspace adaptation via SVD

Learnable scaling vectors enhance flexibility

🔎 Similar Papers

No similar papers found.

Authors to Follow