🤖 AI Summary
Post-training pruning is a key approach for efficient large language model (LLM) compression, yet its performance critically depends on calibration data—especially at high sparsity—while existing work lacks systematic analysis of this dependency. This paper reveals, for the first time, that calibration data quality exerts a greater impact on pruning accuracy than pruning algorithm design itself. To address the inaccessibility of pretraining data, we propose a calibration data synthesis strategy grounded in the model’s autoregressive capability; further, we adapt mainstream pruning methods—including Wanda and OWL—based on calibration data distribution analysis. Experiments on strong foundation models (e.g., DCLM, LLaMA-3) demonstrate that our approach substantially outperforms conventional calibration datasets (e.g., WikiText), improving pruning accuracy by up to 15.2% at high sparsity and significantly enhancing the robustness and generalization of leading pruning methods.
📝 Abstract
As large language models (LLMs) are widely applied across various fields, model compression has become increasingly crucial for reducing costs and improving inference efficiency. Post-training pruning is a promising method that does not require resource-intensive iterative training and only needs a small amount of calibration data to assess the importance of parameters. Previous research has primarily focused on designing advanced pruning methods, while different calibration data's impact on pruning performance still lacks systematical exploration. We fill this blank and surprisingly observe that the effects of calibration data even value more than designing advanced pruning strategies, especially for high sparsity. Our preliminary exploration also discloses that using calibration data similar to the training data can yield better performance. As pre-training data is usually inaccessible for advanced LLMs, we further provide a self-generating calibration data synthesis strategy to construct feasible calibration data. We conduct experiments on the recent strong open-source LLMs (e.g., DCLM, and LLaMA-3), and the results show that the proposed method outperforms commonly used calibration data and can effectively enhance strong pruning methods (e.g., Wanda, OWL).