🤖 AI Summary
Deploying general-purpose large language models (LLMs) incurs substantial inference overhead, and existing pruning methods struggle to simultaneously preserve expert-level performance on specialized tasks and retain broad general capabilities.
Method: We propose a three-dimensional customized pruning paradigm—structured pruning fine-grained along language, domain, and task dimensions—guided by neuron importance analysis, multi-dimensional semantic alignment evaluation, and cross-model-family generalization strategies, enabling zero-shot expert model generation without post-training.
Contribution/Results: Evaluated across mainstream model families (Llama, Qwen, Phi), our approach achieves <0.8% average accuracy degradation on expert tasks while retaining over 96% of general capabilities—outperforming state-of-the-art pruning methods significantly. This marks the first zero-post-training framework for generating high-fidelity expert LLMs without compromising versatility.
📝 Abstract
Large language models (LLMs) have revolutionized natural language processing, yet their substantial model sizes often require substantial computational resources. To preserve computing resources and accelerate inference speed, it is crucial to prune redundant parameters, especially for experienced users who often need compact expert models tailored to specific downstream scenarios. However, most existing pruning methods focus on preserving the model's general capabilities, often requiring extensive post-training or suffering from degraded performance due to coarse-grained pruning. In this work, we design a $underline{Cus}$tom $underline{Prun}$ing method ($ exttt{Cus-Prun}$) to prune a large general model into a smaller lightweight expert model, which is positioned along the"language","domain"and"task"dimensions. By identifying and pruning irrelevant neurons of each dimension, $ exttt{Cus-Prun}$ creates expert models without any post-training. Our experiments demonstrate that $ exttt{Cus-Prun}$ consistently outperforms other methods, achieving minimal loss in both expert and general capabilities across various models from different model families and sizes.