HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant GPU memory and latency overhead caused by visual inputs in multimodal large language model inference, which leads to substantial expansion of the key-value (KV) cache. The authors propose a hybrid KV cache compression framework that, for the first time, categorizes attention heads based on their heterogeneous behaviors and integrates text-prior-based pruning for static heads with block-wise retrieval for dynamic heads. A top-down hierarchical budget allocation strategy is employed to overcome the limitations of conventional single-granularity compression within a unified framework. Evaluated on Qwen2.5-VL-7B, the method achieves up to 7.9× KV cache compression and 1.52× decoding speedup, with negligible performance degradation—indeed, slight improvements are observed in some cases.
📝 Abstract
Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates complementary strategies in three stages: heads are first classified into static or dynamic types using text-centric attention; then a top-down budget allocation scheme hierarchically assigns KV budgets; finally, static heads are compressed by text-prior pruning and dynamic heads by chunk-wise retrieval. Experiments on 11 multimodal benchmarks with Qwen2.5-VL-7B show that HybridKV reduces KV cache memory by up to $7.9\times$ and achieves $1.52\times$ faster decoding, with almost no performance drop or even higher relative to the full-cache MLLM.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
KV cache
memory overhead
inference efficiency
visual tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression
multimodal large language models
attention head heterogeneity
hybrid compression framework
efficient inference
🔎 Similar Papers
No similar papers found.
B
Bowen Zeng
The State Key Laboratory of Blockchain and Data Security, Zhejiang University; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
F
Feiyang Ren
The State Key Laboratory of Blockchain and Data Security, Zhejiang University; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Jun Zhang
Jun Zhang
Bosch Security Systems B.V.
Computer VisionMachine LearningImage Processing
X
Xiaoling Gu
Hangzhou Dianzi University, Hangzhou, China
Ke Chen
Ke Chen
Associate Professor of Computer Science, Zhejiang University
database system
Lidan Shou
Lidan Shou
Professor of Computer Science, Zhejiang University
DatabaseData & Knowledge ManagementML Systems
Huan Li
Huan Li
ZJU100 Young Professor
AI Data PreparationEfficient AISpatiotemporal Data