🤖 AI Summary
This work identifies that random initialization in Low-Rank Adaptation (LoRA) often leads to convergence toward suboptimal low-rank solutions, degrading generalization. To address this, we propose High-Rank Pre-warming (HRP): first performing brief fine-tuning with a high-rank LoRA, then extracting its dominant singular vectors via SVD to initialize the low-rank adapter—yielding theoretically optimal directional initialization. HRP is the first method to formally prove that random initialization induces convergence bias and establishes a task-agnostic adaptive initialization paradigm. It ensures high-rank convergence guarantees while enhancing low-rank generalization. Extensive experiments across multiple architectures (e.g., LLaMA, BERT) and tasks (e.g., GLUE, instruction tuning) demonstrate that HRP consistently outperforms existing initialization strategies—including standard random and SVD-based baselines—and achieves performance on par with full-parameter fine-tuning, without requiring task-specific priors or additional inference overhead.
📝 Abstract
This paper studies the crucial impact of initialization on the convergence properties of Low-Rank Adaptation (LoRA). We theoretically demonstrate that random initialization, a widely used schema, will likely lead LoRA to random low-rank results, rather than the best low-rank result. While this issue can be mitigated by adjusting initialization towards a well-informed direction, it relies on prior knowledge of the target, which is typically unknown in real-world scenarios. To approximate this well-informed initial direction, we propose High-Rank Preheating (HRP), which fine-tunes high-rank LoRA for a few steps and uses the singular value decomposition of the preheated result as a superior initialization. HRP initialization is theory-supported to combine the convergence strengths of high-rank LoRA and the generalization strengths of low-rank LoRA. Extensive experiments demonstrate that HRP significantly enhances LoRA's effectiveness across various models and tasks, achieving performance comparable to full-parameter fine-tuning and outperforming other initialization strategies.