🤖 AI Summary
LoRA’s linear adaptation structure inherently limits its representational capacity, creating an expressivity gap compared to nonlinear fine-tuning. To address this, we propose Activation Annealing: a novel training strategy that introduces learnable piecewise nonlinear activations (e.g., Sigmoid or GeLU) during early training stages to enhance modeling capability, then progressively anneals them to linearity—yielding a strictly mergeable LoRA module at convergence. Our method jointly optimizes gradients across diverse paradigms—including supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and speculative decoding—while preserving LoRA’s low GPU memory footprint and deployment compatibility. Empirical results demonstrate that Activation Annealing significantly narrows the performance gap between LoRA and full-parameter fine-tuning, achieving near-parity or even matching full-parameter performance across multiple benchmarks. To the best of our knowledge, this is the first LoRA enhancement framework enabling dynamic “nonlinear training, linear inference” adaptation.
📝 Abstract
Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method. However, its linear adaptation process limits its expressive power. This means there is a gap between the expressive power of linear training and non-linear training. To bridge this gap, we propose AFA-LoRA, a novel training strategy that brings non-linear expressivity to LoRA while maintaining its seamless mergeability. Our key innovation is an annealed activation function that transitions from a non-linear to a linear transformation during training, allowing the adapter to initially adopt stronger representational capabilities before converging to a mergeable linear form. We implement our method on supervised fine-tuning, reinforcement learning, and speculative decoding. The results show that AFA-LoRA reduces the performance gap between LoRA and full-parameter training. This work enables a more powerful and practical paradigm of parameter-efficient adaptation.