Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenge that in LoRA fine-tuning, the optimal learning rate must be repeatedly tuned for different adapter ranks due to the absence of a unified scaling law and poor transferability to full-parameter fine-tuning. To resolve this, the authors propose μA, a theoretical framework grounded in Maximal-Update Parametrization (μP), which for the first time reveals two distinct optimal scaling laws governing the relationship between LoRA learning rates and adapter rank. Furthermore, μA establishes a principled mechanism for transferring learning rates from low-rank adaptation to full fine-tuning. Extensive experiments across language, vision, multimodal, image generation, and reinforcement learning tasks demonstrate that μA significantly reduces hyperparameter tuning costs and enables learning rates optimized for LoRA to be directly applied to full fine-tuning, thereby substantially improving training efficiency.

Technology Category

Application Category

📝 Abstract

Low-Rank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation ($\mu$A), a theoretical framework that characterizes how the"optimal"learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. $\mu$A is inspired from the Maximal-Update Parametrization ($\mu$P) in pretraining. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision--language, image generation, and reinforcement learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.

Problem

Research questions and friction points this paper is trying to address.

LoRA

learning rate scaling

adapter rank

full finetuning

hyperparameter transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA

learning rate scaling

parameter-efficient finetuning

hyperparameter transfer

Maximal-Update Adaptation

🔎 Similar Papers

ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation

2024-06-16arXiv.orgCitations: 3

💼 Related Jobs

Research Engineer, Monetization AI