PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the high computational resource demands of training large-scale Vision Transformers (ViTs), this paper proposes Dynamic Low-Rank Adaptation (Dynamic LoRA). Unlike static approaches, Dynamic LoRA initially performs full-parameter fine-tuning, then dynamically identifies layer-wise convergence by monitoring weight update magnitudes and adaptively transitions to hierarchical low-rank adaptation. Its key innovations include a module-level rank allocation strategy and a hyperparameter-driven phase-switching mechanism. Evaluated on ViT-Large, the method achieves lossless accuracy relative to full fine-tuning. Experiments demonstrate that Dynamic LoRA reduces trainable parameters by 90% (to 10% of full fine-tuning), triples training throughput, shortens per-epoch training time by 1.5×, and decreases GPU memory consumption by 20%. It consistently outperforms both static LoRA and full fine-tuning across efficiency and accuracy metrics.

Technology Category

Application Category

📝 Abstract

Training large models ranging from millions to billions of parameters is highly resource-intensive, requiring significant time, compute, and memory. It is observed that most of the learning (higher change in weights) takes place in the earlier stage of the training loop. These changes stabilize as training continues, enabling them to be captured by matrices of a low intrinsic rank. Therefore, we propose an approach to identify such states of partial convergence and dynamically switch from full parameter training to Low-Rank Adaptation (LoRA) on the ViT-Large model. We introduce a flexible approach that leverages user-defined hyperparameters to determine the switching point and assign a rank specific to each module layer based on its level of convergence. Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size, resulting in a 3x improvement in throughput, and a 1.5x reduction in average training time per epoch while also reducing GPU memory consumption by 20%

Problem

Research questions and friction points this paper is trying to address.

Reduces resource-intensive training of large vision transformers

Dynamically switches from full training to low-rank adaptation

Maintains accuracy while cutting parameters and improving efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid pre-training combines full training with LoRA

Dynamic switching based on partial convergence states

Layer-specific rank assignment reduces trainable parameters

🔎 Similar Papers

No similar papers found.