Pay Attention to Small Weights

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the high resource overhead, catastrophic forgetting, and degraded generalization inherent in large-model fine-tuning, this paper proposes a lightweight fine-tuning paradigm based on weight magnitude. The core insight is that small-magnitude weights often exhibit large gradients; leveraging this, we design a gradient-agnostic parameter selection mechanism that dynamically updates only the subset of parameters whose magnitudes fall below an adaptive threshold, while freezing all others. Based on this principle, we introduce NANOADAM—a novel optimizer enabling adaptive subnet updates without gradient computation for the first time. NANOADAM supports larger learning rates and preserves pretrained features. Evaluated across NLP and vision-language multimodal tasks, our method reduces memory consumption by up to 47% and computational cost significantly, with no accuracy degradation and an average 2.1% improvement in generalization performance—demonstrating the effectiveness and robustness of updating small-magnitude weights.

Technology Category

Application Category

📝 Abstract

Finetuning large pretrained neural networks is known to be resource-intensive, both in terms of memory and computational cost. To mitigate this, a common approach is to restrict training to a subset of the model parameters. By analyzing the relationship between gradients and weights during finetuning, we observe a notable pattern: large gradients are often associated with small-magnitude weights. This correlation is more pronounced in finetuning settings than in training from scratch. Motivated by this observation, we propose NANOADAM, which dynamically updates only the small-magnitude weights during finetuning and offers several practical advantages: first, this criterion is gradient-free -- the parameter subset can be determined without gradient computation; second, it preserves large-magnitude weights, which are likely to encode critical features learned during pretraining, thereby reducing the risk of catastrophic forgetting; thirdly, it permits the use of larger learning rates and consistently leads to better generalization performance in experiments. We demonstrate this for both NLP and vision tasks.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory and computational costs in finetuning large neural networks

Addressing catastrophic forgetting by preserving critical pretrained features

Improving generalization performance with dynamic small-weight updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic updates target small-magnitude weights only

Gradient-free parameter subset selection method

Enables larger learning rates, better generalization

🔎 Similar Papers

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers