🤖 AI Summary
To address the excessive computational and memory overhead in large language model (LLM) deployment, this paper proposes a structured post-training pruning method grounded in expander graph theory. It is the first to incorporate the strong connectivity property of expander graphs into N:M sparse pruning design, enabling a theoretically principled weight selection mechanism that preserves critical information-flow paths while ensuring global connectivity and robustness of the pruned network. Compared to conventional structured pruning, our approach significantly improves accuracy retention under high sparsity. Evaluated on Llama-2 and OPT, it achieves an average 2.1× inference speedup and 48% memory reduction, outperforming state-of-the-art methods by 1.3–2.7 percentage points in accuracy. The core contribution lies in establishing a formal theoretical link between expander graph properties and the learnability of sparse structures, thereby introducing a novel paradigm for efficient LLM deployment.
📝 Abstract
As Large Language Models (LLMs) become more widely adopted and scale up in size, the computational and memory challenges involved in deploying these massive foundation models have grown increasingly severe. This underscores the urgent need to develop more efficient model variants. Faced with this challenge, the present work introduces EGGS-PTP: an Expander-Graph Guided Structured Post-training Pruning method. The proposed approach leverages graph theory to guide the design of N:M structured pruning, effectively reducing model size and computational demands. By incorporating concepts from expander graphs, EGGS-PTP ensures information flow within the pruned network, preserving essential model functionality. Extensive numerical experiments demonstrate that EGGS-PTP not only achieves significant acceleration and memory savings due to structured sparsity but also outperforms existing structured pruning techniques in terms of accuracy across various LLMs.