🤖 AI Summary
Large vision-language models (LVLMs) face severe memory bottlenecks during long-sequence inference due to KV cache explosion, and existing importance-based KV compression methods overlook modality-specific semantic redundancy—particularly heterogeneous redundancy across attention heads—leading to insufficient semantic coverage.
Method: We first uncover and model head-level semantic redundancy distributions in multimodal KV caches, then propose a joint importance-diversity optimization framework for KV compression. Our approach employs attention-head feature analysis to design an adaptive hybrid strategy, compatible with mainstream methods (e.g., SnapKV, AdaKV) and extensible to LLMs.
Results: Under extreme compression budgets (64 tokens), our method achieves an average performance gain of 5.1% across benchmarks, with up to 9.0% improvement on GUI localization tasks, while preserving both inference efficiency and semantic integrity.
📝 Abstract
Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose exttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. exttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that exttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), exttt{MixKV} improves baseline methods by an average of extbf{5.1%} across five multi-modal understanding benchmarks and achieves remarkable gains of extbf{8.0%} and extbf{9.0%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, exttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at href{https://github.com/xuyang-liu16/MixKV}{ extcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.