Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the limited generalization capability of pretrained Vision Transformers (ViTs) during downstream fine-tuning, this paper proposes Orthogonal Low-Rank Adaptation (Ortho-LoRA). The method is grounded in the novel observation that ViT backbone weights exhibit an approximately orthogonal structure—exploited by constructing near-orthogonal low-rank projection matrices in LoRA modules using only a single learnable vector. This design enhances adapter generalization and training stability without introducing additional parameters or computational overhead, and seamlessly integrates with standard LoRA. Evaluated on ImageNet and multiple fine-grained image classification benchmarks, Ortho-LoRA achieves state-of-the-art performance with fewer tunable parameters, empirically validating the critical role of weight orthogonality in improving generalization.

Technology Category

Application Category

📝 Abstract

A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this work, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model's generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices.

Problem

Research questions and friction points this paper is trying to address.

Enhancing generalization in Vision Transformer fine-tuning

Addressing lack of orthogonality in low-rank adaptation matrices

Improving downstream task performance via orthogonal weight matrices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Approximately Orthogonal Fine-Tuning for ViTs

Single vector generates orthogonal projection matrices

Enhances generalization in downstream image tasks

🔎 Similar Papers

No similar papers found.