🤖 AI Summary
This work proposes GRAIL, a post-compression compensation method that restores model accuracy without fine-tuning or labeled data. Under high compression rates, structured models often suffer accuracy degradation and typically require labeled data for recovery. In contrast, GRAIL leverages only a small amount of unlabeled calibration data to reconstruct the input-output behavior of pruned modules via block-wise linear regression, entirely avoiding backpropagation and labels. The approach uniquely enables tuning-free recovery by summarizing hidden activations through Gram matrices and performing ridge regression for linear reconstruction, with the resulting mapping absorbed directly into downstream weights. GRAIL is compatible with diverse pruning strategies and naturally degrades to classical pruning when channel correlations are weak. Experiments demonstrate that GRAIL consistently outperforms both data-independent and data-dependent compression methods across ResNet, Vision Transformers, and decoder-only large language models, significantly improving accuracy or reducing perplexity.
📝 Abstract
Structured deep model compression methods are hardware-friendly and substantially reduce memory and inference costs. However, under aggressive compression, the resulting accuracy degradation often necessitates post-compression finetuning, which can be impractical due to missing labeled data or high training cost. We propose post-hoc blockwise compensation, called GRAIL, a simple zero-finetuning step applied after model compression that restores each block's input-output behavior using a small calibration set. The method summarizes hidden activations via a Gram matrix and applies ridge regression to linearly reconstruct the original hidden representation from the reduced one. The resulting reconstruction map is absorbed into the downstream projection weights, while the upstream layer is compressed. The approach is selector-agnostic (Magnitude, Wanda, Gram-based selection, or folding), data-aware (requiring only a few forward passes without gradients or labels), and recovers classic pruning or folding when the Gram matrix is near identity, indicating weak inter-channel correlations. Across ResNets, ViTs, and decoder-only LLMs, GRAIL consistently improves accuracy or perplexity over data-free and data-aware pruning or folding baselines in practical compression regimes, with manageable overhead and no backpropagation. The code is available at https://github.com/TWWinde/GRAIL.