π€ AI Summary
While LoRA is parameter-efficient, flat solutions in its low-rank optimization subspace may still correspond to sharp directions in the full-parameter space, harming generalization. Method: This paper introduces Bayesian-LoRAβthe first approach to explicitly incorporate loss surface flatness constraints into the LoRA objective. It designs a lightweight stochastic weight perturbation scheme grounded in Bayesian expected loss, avoiding high-overhead second-order methods (e.g., SAM) and eliminating the need for extra backpropagation or Hessian computation. Perturbation and low-rank decomposition are jointly optimized. Results: Bayesian-LoRA significantly improves generalization across diverse NLP and image classification tasks and architectures, with training cost comparable to standard LoRA. Its core contribution is establishing the first theoretical link between LoRA optimization and full-parameter-space flatness, yielding the first gradient-free, perturbation-based PEFT method that simultaneously achieves efficiency and strong generalization.
π Abstract
Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computational and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, provides an efficient way to fine-tune models by optimizing only a low-rank matrix. Despite recent progress made in improving LoRA's performance, the connection between the LoRA optimization space and the original full parameter space is often overlooked. A solution that appears flat in the LoRA space may exist sharp directions in the full parameter space, potentially harming generalization performance. In this paper, we propose Flat-LoRA, an efficient approach that seeks a low-rank adaptation located in a flat region of the full parameter space.Instead of relying on the well-established sharpness-aware minimization approach, which can incur significant computational and memory burdens, we utilize random weight perturbation with a Bayesian expectation loss objective to maintain training efficiency and design a refined perturbation generation strategy for improved performance. Experiments on natural language processing and image classification tasks with various architectures demonstrate the effectiveness of our approach.