Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the degradation in downstream generalization caused by distributional bias in few-shot calibration sets during structured pruning of large language models (LLMs), this paper proposes the Function-Aware Neuron Grouping (FANG) framework. Methodologically, FANG introduces four key innovations: (1) semantic-context-driven neuron clustering based on functional roles; (2) context-aware weighted importance estimation; (3) cross-context functional contribution identification and preservation; and (4) block-level adaptive sparsity allocation. FANG is compatible with existing pruning methods such as FLAP and OBC. Under 30% and 40% sparsity ratios, FANG achieves average downstream task accuracy improvements of 1.5–8.5% over FLAP and OBC, significantly enhancing language modeling capability and cross-task generalization. It establishes a new state-of-the-art in structured LLM pruning.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) demonstrate impressive performance across natural language tasks but incur substantial computational and storage costs due to their scale. Post-training structured pruning offers an efficient solution. However, when few-shot calibration sets fail to adequately reflect the pretraining data distribution, existing methods exhibit limited generalization to downstream tasks. To address this issue, we propose Function-Aware Neuron Grouping (FANG), a post-training pruning framework that alleviates calibration bias by identifying and preserving neurons critical to specific function. FANG groups neurons with similar function based on the type of semantic context they process and prunes each group independently. During importance estimation within each group, tokens that strongly correlate with the functional role of the neuron group are given higher weighting. Additionally, FANG also preserves neurons that contribute across multiple context types. To achieve a better trade-off between sparsity and performance, it allocates sparsity to each block adaptively based on its functional complexity. Experiments show that FANG improves downstream accuracy while preserving language modeling performance. It achieves the state-of-the-art (SOTA) results when combined with FLAP and OBC, two representative pruning methods. Specifically, FANG outperforms FLAP and OBC by 1.5%--8.5% in average accuracy under 30% and 40% sparsity.

Problem

Research questions and friction points this paper is trying to address.

Addresses poor generalization in LLM pruning from biased calibration sets

Proposes function-aware neuron grouping to preserve critical functional neurons

Enhances sparsity-performance trade-off via adaptive block-wise sparsity allocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Groups neurons by semantic function for pruning

Weights tokens by functional role during importance estimation

Adaptively allocates sparsity based on block complexity

🔎 Similar Papers

BlockPruner: Fine-grained Pruning for Large Language Models