One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing LLM pruning methods suffer from poor efficiency under multi-user concurrent inference, exhibiting linear growth in processing latency with increasing request count. To address this, we propose UniCuCo, a universal and customizable compression framework centered on StratNet—a strategy network that maps arbitrary compression requirements (e.g., target sparsity, hardware constraints) to optimal structured pruning policies. Crucially, we introduce a Gaussian process surrogate model to approximate non-differentiable pruning evaluation metrics, enabling end-to-end differentiable training. This marks the first realization of a “train-once, adapt-many” paradigm for generalizable pruning strategy generation. Experiments demonstrate that UniCuCo achieves a 28× inference speedup under 64 concurrent requests while maintaining accuracy competitive with state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Existing pruning methods for large language models (LLMs) focus on achieving high compression rates while maintaining model performance. Although these methods have demonstrated satisfactory performance in handling a single user's compression request, their processing time increases linearly with the number of requests, making them inefficient for real-world scenarios with multiple simultaneous requests. To address this limitation, we propose a Univeral Model for Customized Compression (UniCuCo) for LLMs, which introduces a StratNet that learns to map arbitrary requests to their optimal pruning strategy. The challenge in training StratNet lies in the high computational cost of evaluating pruning strategies and the non-differentiable nature of the pruning process, which hinders gradient backpropagation for StratNet updates. To overcome these challenges, we leverage a Gaussian process to approximate the evaluation process. Since the gradient of the Gaussian process is computable, we can use it to approximate the gradient of the non-differentiable pruning process, thereby enabling StratNet updates. Experimental results show that UniCuCo is 28 times faster than baselines in processing 64 requests, while maintaining comparable accuracy to baselines.

Problem

Research questions and friction points this paper is trying to address.

Efficiently handle multiple LLM compression requests simultaneously

Learn optimal pruning strategies for diverse user requests

Overcome non-differentiable pruning process for gradient-based updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal model for customized LLM compression

StratNet maps requests to pruning strategies

Gaussian process approximates non-differentiable pruning

🔎 Similar Papers

No similar papers found.