🤖 AI Summary
Existing parameter-efficient fine-tuning (PEFT) methods optimize only model weights while keeping activation functions fixed. This work proposes NoRA—the first efficient framework to directly fine-tune nonlinear activation functions in Transformers—by replacing fixed activations with learnable rational functions. NoRA employs a grouped, structured low-rank update to adapt numerator and denominator coefficients, enabling stable adaptation with extremely few parameters. It establishes the novel paradigm of “activation-space optimization,” which complements weight-based tuning and provides implicit regularization. NoRA seamlessly integrates with LoRA to form NoRA++, maintaining compatibility with mainstream architectures. Experiments demonstrate that on CIFAR-10/100, NoRA achieves superior accuracy (+0.17–0.27%) using only 0.4% trainable parameters compared to full fine-tuning. When combined with LoRA on LLaMA3-8B, NoRA++ improves average MMLU performance by 0.3–0.8%, with gains up to 1.6% on STEM tasks.
📝 Abstract
Existing parameter-efficient fine-tuning (PEFT) methods primarily adapt weight matrices while keeping activation functions fixed. We introduce extbf{NoRA}, the first PEFT framework that directly adapts nonlinear activation functions in pretrained transformer-based models. NoRA replaces fixed activations with learnable rational functions and applies structured low-rank updates to numerator and denominator coefficients, with a group-wise design that localizes adaptation and improves stability at minimal cost. On vision transformers trained on CIFAR-10 and CIFAR-100, NoRA matches or exceeds full fine-tuning while updating only 0.4% of parameters (0.02M), achieving accuracy gains of +0.17% and +0.27%. When combined with LoRA ( extbf{NoRA++}), it outperforms LoRA and DoRA under matched training budgets by adding fewer trainable parameters. On LLaMA3-8B instruction tuning, NoRA++ consistently improves generation quality, yielding average MMLU gains of +0.3%--0.8%, including +1.6% on STEM (Alpaca) and +1.3% on OpenOrca. We further show that NoRA constrains adaptation to a low-dimensional functional subspace, implicitly regularizing update magnitude and direction. These results establish activation-space tuning as a complementary and highly parameter-efficient alternative to weight-based PEFT, positioning activation functions as first-class objects for model adaptation.