Don't Forget the Nonlinearity: Unlocking Activation Functions in Efficient Fine-Tuning

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing parameter-efficient fine-tuning (PEFT) methods optimize only model weights while keeping activation functions fixed. This work proposes NoRA—the first efficient framework to directly fine-tune nonlinear activation functions in Transformers—by replacing fixed activations with learnable rational functions. NoRA employs a grouped, structured low-rank update to adapt numerator and denominator coefficients, enabling stable adaptation with extremely few parameters. It establishes the novel paradigm of “activation-space optimization,” which complements weight-based tuning and provides implicit regularization. NoRA seamlessly integrates with LoRA to form NoRA++, maintaining compatibility with mainstream architectures. Experiments demonstrate that on CIFAR-10/100, NoRA achieves superior accuracy (+0.17–0.27%) using only 0.4% trainable parameters compared to full fine-tuning. When combined with LoRA on LLaMA3-8B, NoRA++ improves average MMLU performance by 0.3–0.8%, with gains up to 1.6% on STEM tasks.

Technology Category

Application Category

📝 Abstract
Existing parameter-efficient fine-tuning (PEFT) methods primarily adapt weight matrices while keeping activation functions fixed. We introduce extbf{NoRA}, the first PEFT framework that directly adapts nonlinear activation functions in pretrained transformer-based models. NoRA replaces fixed activations with learnable rational functions and applies structured low-rank updates to numerator and denominator coefficients, with a group-wise design that localizes adaptation and improves stability at minimal cost. On vision transformers trained on CIFAR-10 and CIFAR-100, NoRA matches or exceeds full fine-tuning while updating only 0.4% of parameters (0.02M), achieving accuracy gains of +0.17% and +0.27%. When combined with LoRA ( extbf{NoRA++}), it outperforms LoRA and DoRA under matched training budgets by adding fewer trainable parameters. On LLaMA3-8B instruction tuning, NoRA++ consistently improves generation quality, yielding average MMLU gains of +0.3%--0.8%, including +1.6% on STEM (Alpaca) and +1.3% on OpenOrca. We further show that NoRA constrains adaptation to a low-dimensional functional subspace, implicitly regularizing update magnitude and direction. These results establish activation-space tuning as a complementary and highly parameter-efficient alternative to weight-based PEFT, positioning activation functions as first-class objects for model adaptation.
Problem

Research questions and friction points this paper is trying to address.

Adapting nonlinear activation functions in pretrained transformers
Enabling parameter-efficient fine-tuning via learnable rational functions
Improving model performance with minimal parameter updates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts nonlinear activation functions directly
Uses learnable rational functions for activations
Applies structured low-rank updates to coefficients
🔎 Similar Papers
No similar papers found.