🤖 AI Summary
Existing LLM pruning methods suffer from poor efficiency under multi-user concurrent inference, exhibiting linear growth in processing latency with increasing request count. To address this, we propose UniCuCo, a universal and customizable compression framework centered on StratNet—a strategy network that maps arbitrary compression requirements (e.g., target sparsity, hardware constraints) to optimal structured pruning policies. Crucially, we introduce a Gaussian process surrogate model to approximate non-differentiable pruning evaluation metrics, enabling end-to-end differentiable training. This marks the first realization of a “train-once, adapt-many” paradigm for generalizable pruning strategy generation. Experiments demonstrate that UniCuCo achieves a 28× inference speedup under 64 concurrent requests while maintaining accuracy competitive with state-of-the-art methods.
📝 Abstract
Existing pruning methods for large language models (LLMs) focus on achieving high compression rates while maintaining model performance. Although these methods have demonstrated satisfactory performance in handling a single user's compression request, their processing time increases linearly with the number of requests, making them inefficient for real-world scenarios with multiple simultaneous requests. To address this limitation, we propose a Univeral Model for Customized Compression (UniCuCo) for LLMs, which introduces a StratNet that learns to map arbitrary requests to their optimal pruning strategy. The challenge in training StratNet lies in the high computational cost of evaluating pruning strategies and the non-differentiable nature of the pruning process, which hinders gradient backpropagation for StratNet updates. To overcome these challenges, we leverage a Gaussian process to approximate the evaluation process. Since the gradient of the Gaussian process is computable, we can use it to approximate the gradient of the non-differentiable pruning process, thereby enabling StratNet updates. Experimental results show that UniCuCo is 28 times faster than baselines in processing 64 requests, while maintaining comparable accuracy to baselines.