A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

It remains unclear whether full fine-tuning is strictly necessary to recover performance after pruning large language models (LLMs). Method: This paper systematically investigates weight reconstruction strategies for GPT architectures, proposing a fine-grained local reconstruction approach: within each Transformer block, attention and MLP submodules are reconstructed independently—rather than applying global fine-tuning. The method integrates inter-layer mask selection, matrix reconstruction on small calibration datasets, and lightweight pruning criteria (e.g., Wanda), incurring negligible memory overhead—far below that of full fine-tuning. Results: Experiments across mainstream GPT models demonstrate Pareto-optimal trade-offs: significantly lower perplexity than full fine-tuning, alongside substantial reductions in computational and memory costs. This work provides the first empirical evidence that carefully designed lightweight reconstruction can fully replace full fine-tuning—challenging the longstanding assumption that “pruning necessitates retraining”—and establishes a new paradigm for efficient model compression.

Technology Category

Application Category

📝 Abstract

While Neural Network pruning typically requires retraining the model to recover pruning-induced performance degradation, state-of-the-art Large Language Models (LLMs) pruning methods instead solve a layer-wise mask selection and reconstruction problem on a small set of calibration data to avoid full retraining, as it is considered computationally infeasible for LLMs. Reconstructing single matrices in isolation has favorable properties, such as convexity of the objective and significantly reduced memory requirements compared to full retraining. In practice, however, reconstruction is often implemented at coarser granularities, e.g., reconstructing a whole transformer block against its dense activations instead of a single matrix. In this work, we study the key design choices when reconstructing or retraining the remaining weights after pruning. We conduct an extensive computational study on state-of-the-art GPT architectures, and report several surprising findings that challenge common intuitions about retraining after pruning. In particular, we observe a free lunch scenario: reconstructing attention and MLP components separately within each transformer block is nearly the most resource-efficient yet achieves the best perplexity. Most importantly, this Pareto-optimal setup achieves better performance than full retraining, despite requiring only a fraction of the memory. Furthermore, we demonstrate that simple and efficient pruning criteria such as Wanda can outperform much more complex approaches when the reconstruction step is properly executed, highlighting its importance. Our findings challenge the narrative that retraining should be avoided at all costs and provide important insights into post-pruning performance recovery for LLMs.

Problem

Research questions and friction points this paper is trying to address.

Revisiting retraining versus reconstruction trade-offs in LLM pruning

Identifying optimal granularity for weight reconstruction after pruning

Challenging assumptions about retraining necessity in large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstructing attention and MLP components separately after pruning

Using simple pruning criteria with proper reconstruction execution

Achieving better performance than full retraining with less memory

🔎 Similar Papers

No similar papers found.