🤖 AI Summary
This work investigates why network pruning, effective in non-generative tasks, often fails in generative settings—a phenomenon whose root cause remains unclear. By decomposing internal representations of language models into embedding, logit, and probability spaces, the study reveals, for the first time, how pruning-induced perturbations propagate and amplify differently across task types. Through representational space decomposition, perturbation analysis, and modeling of error accumulation across time steps—supported by empirical experiments—the authors identify the softmax nonlinearity as the key factor driving performance degradation in generative tasks. This framework systematically explains the observed efficacy of pruning in retrieval and multiple-choice tasks versus its failure in generation, offering both theoretical grounding and practical guidance for applying pruning methods in real-world scenarios.
📝 Abstract
Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at https://github.com/CASE-Lab-UMD/Pruning-on-Representations