PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational resource demands of training large-scale Vision Transformers (ViTs), this paper proposes Dynamic Low-Rank Adaptation (Dynamic LoRA). Unlike static approaches, Dynamic LoRA initially performs full-parameter fine-tuning, then dynamically identifies layer-wise convergence by monitoring weight update magnitudes and adaptively transitions to hierarchical low-rank adaptation. Its key innovations include a module-level rank allocation strategy and a hyperparameter-driven phase-switching mechanism. Evaluated on ViT-Large, the method achieves lossless accuracy relative to full fine-tuning. Experiments demonstrate that Dynamic LoRA reduces trainable parameters by 90% (to 10% of full fine-tuning), triples training throughput, shortens per-epoch training time by 1.5×, and decreases GPU memory consumption by 20%. It consistently outperforms both static LoRA and full fine-tuning across efficiency and accuracy metrics.

Technology Category

Application Category

📝 Abstract
Training large models ranging from millions to billions of parameters is highly resource-intensive, requiring significant time, compute, and memory. It is observed that most of the learning (higher change in weights) takes place in the earlier stage of the training loop. These changes stabilize as training continues, enabling them to be captured by matrices of a low intrinsic rank. Therefore, we propose an approach to identify such states of partial convergence and dynamically switch from full parameter training to Low-Rank Adaptation (LoRA) on the ViT-Large model. We introduce a flexible approach that leverages user-defined hyperparameters to determine the switching point and assign a rank specific to each module layer based on its level of convergence. Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size, resulting in a 3x improvement in throughput, and a 1.5x reduction in average training time per epoch while also reducing GPU memory consumption by 20%
Problem

Research questions and friction points this paper is trying to address.

Reduces resource-intensive training of large vision transformers
Dynamically switches from full training to low-rank adaptation
Maintains accuracy while cutting parameters and improving efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid pre-training combines full training with LoRA
Dynamic switching based on partial convergence states
Layer-specific rank assignment reduces trainable parameters
🔎 Similar Papers
No similar papers found.