House of Cards: Massive Weights in LLMs

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Large language models (LLMs) exhibit a few “outlier weights” in early feed-forward layers, whose abnormally high activations induce excessive token-specific reliance, leading to bias and reduced robustness. This work is the first to systematically identify and quantify the dominant role of such weights during pretraining. We propose MacDrop—a weight-importance-aware, progressive dropout strategy for parameter-efficient fine-tuning (PEFT)—that dynamically reduces the retention probability of outlier weights without modifying model architecture. Leveraging weight sensitivity analysis and curriculum-style dropout scheduling, MacDrop achieves global behavioral calibration by adjusting fewer than 0.1% of parameters. Experiments demonstrate consistent performance gains across zero-shot, long-context, and diverse downstream tasks, alongside significant improvements in model robustness.

Technology Category

Application Category

📝 Abstract

Massive activations, which manifest in specific feature dimensions of hidden states, introduce a significant bias in large language models (LLMs), leading to an overemphasis on the corresponding token. In this paper, we identify that massive activations originate not from the hidden state but from the intermediate state of a feed-forward network module in an early layer. Expanding on the previous observation that massive activations occur only in specific feature dimensions, we dive deep into the weights that cause massive activations. Specifically, we define top-$k$ massive weights as the weights that contribute to the dimensions with the top-$k$ magnitudes in the intermediate state. When these massive weights are set to zero, the functionality of LLMs is entirely disrupted. However, when all weights except for massive weights are set to zero, it results in a relatively minor performance drop, even though a much larger number of weights are set to zero. This implies that during the pre-training process, learning is dominantly focused on massive weights. Building on this observation, we propose a simple plug-and-play method called MacDrop (massive weights curriculum dropout), to rely less on massive weights during parameter-efficient fine-tuning. This method applies dropout to the pre-trained massive weights, starting with a high dropout probability and gradually decreasing it as fine-tuning progresses. Through various experiments, including zero-shot downstream tasks, long-context tasks, and ablation studies, we demonstrate that exttt{MacDrop} generally improves performance and strengthens robustness.

Problem

Research questions and friction points this paper is trying to address.

Identify origin of massive activations in LLMs.

Define and analyze top-k massive weights impact.

Propose MacDrop method for efficient fine-tuning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

MacDrop for fine-tuning

Dropout on massive weights

Enhances model robustness

🔎 Similar Papers

No similar papers found.