Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge of excessive KV cache pressure in vision-language model (VLM) inference caused by the large number of visual tokens from image encoders, which existing token pruning methods often alleviate at the cost of fine-grained visual information. To tackle this, the authors propose RotateK, a framework that, under a fixed KV cache budget, introduces online principal component analysis (PCA) to rotate and align Key channels, enabling structured channel-wise pruning that better preserves critical visual tokens. RotateK further integrates lightweight head-level masking and a custom Triton-based sparse attention kernel to facilitate efficient sparse inference. Experiments demonstrate that RotateK consistently outperforms current Key-channel pruning and pure token pruning approaches across mainstream VLM backbones, achieving state-of-the-art trade-offs between accuracy and decoding latency.

📝 Abstract

Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

Problem

Research questions and friction points this paper is trying to address.

KV cache

channel pruning

vision-language models

feature sparsity

inference efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured pruning

rotation alignment

KV cache compression