🤖 AI Summary
Existing pruning methods for LLM edge deployment and large-scale inference face critical limitations: width pruning disrupts standard Transformer architecture, while depth pruning causes sharp accuracy degradation.
Method: We propose the first unified framework jointly compressing the rare vocabulary and pruning FFN channels. Our approach dynamically assesses token and channel importance via co-occurring word weights, enabling adaptive, training-free, vocabulary-level and width-wise pruning. It further employs weighted-activation-driven channel pruning at FFN intermediate layers—requiring no fine-tuning or custom inference code and fully preserving compatibility with standard Transformer implementations.
Results: Evaluated on Qwen, LLaMA, and Gemma (0.5B–70B), our method significantly reduces parameter count, GPU memory footprint, and end-to-end latency, while matching or outperforming state-of-the-art pruning methods on downstream tasks. It achieves superior deployment efficiency without compromising accuracy stability.
📝 Abstract
Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.