Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work reveals that feedforward networks in large language models contain a small number of highly loss-sensitive core channels—termed supernodes—along with their surrounding halo structures, and demonstrates that conventional pruning methods often degrade performance by inadvertently disrupting these critical pathways due to their ignorance of this architectural motif. To address this, the authors propose the SCAR family of structured pruning techniques, which introduces a Fisher-like loss proxy based on the second-order moments of activations and gradients to explicitly identify and preserve supernodes. Departing from traditional approaches that rely solely on activation magnitudes or weight norms, SCAR achieves superior performance: at 50% sparsity in feedforward layers, SCAR-Prot attains a perplexity of 54.8 on Llama-3.1-8B, substantially outperforming Wanda-channel (989.2). This phenomenon is consistently observed across multiple mainstream large language models.

Technology Category

Application Category

📝 Abstract

We study the organization of channel-level importance in transformer feed-forward networks (FFNs). Using a Fisher-style loss proxy (LP) based on activation-gradient second moments, we show that loss sensitivity is concentrated in a small set of channels within each layer. In Llama-3.1-8B, the top 1% of channels per layer accounts for a median of 58.7% of LP mass, with a range of 33.0% to 86.1%. We call these loss-critical channels supernodes. Although FFN layers also contain strong activation outliers, LP-defined supernodes overlap only weakly with activation-defined outliers and are not explained by activation power or weight norms alone. Around this core, we find a weaker but consistent halo structure: some non-supernode channels share the supernodes' write support and show stronger redundancy with the protected core. We use one-shot structured FFN pruning as a diagnostic test of this organization. At 50% FFN sparsity, baselines that prune many supernodes degrade sharply, whereas our SCAR variants explicitly protect the supernode core; the strongest variant, SCAR-Prot, reaches perplexity 54.8 compared with 989.2 for Wanda-channel. The LP-concentration pattern appears across Mistral-7B, Llama-2-7B, and Qwen2-7B, remains visible in targeted Llama-3.1-70B experiments, and increases during OLMo-2-7B pretraining. These results suggest that LLM FFNs develop a small learned core of loss-critical channels, and that preserving this core is important for reliable structured pruning.

Problem

Research questions and friction points this paper is trying to address.

supernodes

loss-critical channels

feed-forward networks

structured pruning

channel importance

Innovation

Methods, ideas, or system contributions that make the work stand out.

supernodes

loss proxy

structured pruning