Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited generalization capability of pretrained Vision Transformers (ViTs) during downstream fine-tuning, this paper proposes Orthogonal Low-Rank Adaptation (Ortho-LoRA). The method is grounded in the novel observation that ViT backbone weights exhibit an approximately orthogonal structure—exploited by constructing near-orthogonal low-rank projection matrices in LoRA modules using only a single learnable vector. This design enhances adapter generalization and training stability without introducing additional parameters or computational overhead, and seamlessly integrates with standard LoRA. Evaluated on ImageNet and multiple fine-grained image classification benchmarks, Ortho-LoRA achieves state-of-the-art performance with fewer tunable parameters, empirically validating the critical role of weight orthogonality in improving generalization.

Technology Category

Application Category

📝 Abstract
A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this work, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model's generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices.
Problem

Research questions and friction points this paper is trying to address.

Enhancing generalization in Vision Transformer fine-tuning
Addressing lack of orthogonality in low-rank adaptation matrices
Improving downstream task performance via orthogonal weight matrices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Approximately Orthogonal Fine-Tuning for ViTs
Single vector generates orthogonal projection matrices
Enhances generalization in downstream image tasks
🔎 Similar Papers
No similar papers found.
Yiting Yang
Yiting Yang
Xi’an University of Architecture and Technology
H
Hao Luo
Xi’an University of Architecture and Technology
Y
Yuan Sun
University of Electronic Science and Technology of China
Qingsen Yan
Qingsen Yan
Northwestern Polytechnical University
Image processingImage fusionContinual learning
Haokui Zhang
Haokui Zhang
Northwestern Polytechnical University
Approximate nearest neighbor searchneural architecture searchdepth estimationHSI classificaion
W
Wei Dong
Xi’an University of Architecture and Technology
G
Guoqing Wang
University of Electronic Science and Technology of China
P
Peng Wang
University of Electronic Science and Technology of China
Y
Yang Yang
University of Electronic Science and Technology of China
H
Hengtao Shen
TongJi University