COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing pruning methods for LLM edge deployment and large-scale inference face critical limitations: width pruning disrupts standard Transformer architecture, while depth pruning causes sharp accuracy degradation. Method: We propose the first unified framework jointly compressing the rare vocabulary and pruning FFN channels. Our approach dynamically assesses token and channel importance via co-occurring word weights, enabling adaptive, training-free, vocabulary-level and width-wise pruning. It further employs weighted-activation-driven channel pruning at FFN intermediate layers—requiring no fine-tuning or custom inference code and fully preserving compatibility with standard Transformer implementations. Results: Evaluated on Qwen, LLaMA, and Gemma (0.5B–70B), our method significantly reduces parameter count, GPU memory footprint, and end-to-end latency, while matching or outperforming state-of-the-art pruning methods on downstream tasks. It achieves superior deployment efficiency without compromising accuracy stability.

Technology Category

Application Category

📝 Abstract

Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.

Problem

Research questions and friction points this paper is trying to address.

Prunes rare vocabulary to reduce embedding size

Prunes FFN channels using token-weighted activations

Maintains standard transformer architecture for deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes rare vocabulary to shrink embedding layers

Prunes FFN channels using common-token-weighted activations

Maintains standard transformer architecture for deployment

🔎 Similar Papers

No similar papers found.