Leveraging KV Similarity for Online Structured Pruning in LLMs

📅 2025-12-07

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing LLM structured pruning methods rely on offline calibration datasets, suffering from poor generalization and low stability. To address this, we propose Token Filtering—a calibration-free, online dynamic pruning method that jointly measures token redundancy via key-value similarity and skips redundant attention computations in real time during inference. We further introduce a variance-aware weighted fusion strategy to achieve zero additional memory overhead and high robustness. By deeply integrating intrinsic attention mechanism features, our approach enables fine-grained structured pruning. Extensive experiments on LLaMA-2 and Mistral models demonstrate that, at a 50% pruning ratio, Token Filtering significantly outperforms state-of-the-art structured pruning methods on benchmarks such as MMLU, while simultaneously achieving superior efficiency, accuracy, and stability.

Technology Category

Application Category

📝 Abstract

Pruning has emerged as a promising direction for accelerating large language model (LLM) inference, yet existing approaches often suffer from instability because they rely on offline calibration data that may not generalize across inputs. In this work, we introduce Token Filtering, a lightweight online structured pruning technique that makes pruning decisions directly during inference without any calibration data. The key idea is to measure token redundancy via joint key-value similarity and skip redundant attention computations, thereby reducing inference cost while preserving critical information. To further enhance stability, we design a variance-aware fusion strategy that adaptively weights key and value similarity across heads, ensuring that informative tokens are retained even under high pruning ratios. This design introduces no additional memory overhead and provides a more reliable criterion for token importance. Extensive experiments on LLaMA-2 (7B/13B), LLaMA-3 (8B), and Mistral (7B) demonstrate that Token Filtering consistently outperforms prior structured pruning methods, preserving accuracy on commonsense reasoning benchmarks and maintaining strong performance on challenging tasks such as MMLU, even with 50% pruning.

Problem

Research questions and friction points this paper is trying to address.

Online structured pruning for LLMs without calibration data

Measure token redundancy via key-value similarity to reduce cost

Enhance stability with variance-aware fusion across attention heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online structured pruning without calibration data

Token redundancy measured via key-value similarity

Variance-aware fusion strategy for adaptive weighting

🔎 Similar Papers

BlockPruner: Fine-grained Pruning for Large Language Models