🤖 AI Summary
To address the high manual effort in large language model (LLM) pruning and the sharp performance degradation at high sparsity levels caused by weight outliers, this paper proposes a “self-pruning” framework. Methodologically, it pioneers the use of LLMs to autonomously generate pruning algorithms—eliminating reliance on expert-crafted heuristics—and enhances prompt engineering via graph-driven chain-of-thought (GCoT) reasoning to improve algorithm generation quality. Furthermore, it introduces a skew-aware dynamic sparse allocation (SDSA) mechanism that adaptively mitigates outlier-induced distortion during sparse weight distribution. Evaluated on mainstream LLM benchmarks, the framework achieves superior accuracy–sparsity trade-offs over baselines (e.g., Wanda) at high pruning rates (50%–70%), while preserving model interpretability and computational efficiency. Open-sourced code validates the framework’s effectiveness and cross-architecture generalizability.
📝 Abstract
Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to extit{huge labor costs} and extit{requires expert knowledge}. Furthermore, we are the first to identify the serious extit{outlier value issue} behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called extbf{AutoPrune}, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: https://anonymous.4open.science/r/AutoPrune.