KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the substantial increase in KV cache memory overhead in vision-language models during autoregressive generation, caused by the introduction of numerous visual tokens, a challenge inadequately mitigated by existing text-oriented compression methods. To this end, we propose KVCapsule, the first plug-and-play KV cache compression framework tailored for vision-language models. Built upon a structural analysis of the distinct attention redundancy patterns between visual and textual tokens, KVCapsule introduces lightweight, asymmetric compression and reconstruction modules that operate without modifying the pretrained backbone or attention mechanism. Extensive experiments demonstrate that KVCapsule achieves up to 60% reduction in KV cache memory (2.4× compression) and up to 2× higher throughput across diverse vision-language models and benchmarks, while preserving generation quality and task accuracy with negligible degradation.

📝 Abstract

Vision-Language Models (VLMs) have emerged as a critical and fast-growing extension of Large Language Models (LLMs) that enable multimodal reasoning through both text and image inputs. Although VLMs enrich the capabilities of language models, they also inherit and amplify key computational bottlenecks: the memory overhead caused by the large key-value (KV) cache during autoregressive decoding. This challenge is particularly severe in VLMs, where images produce longer token sequences and denser feature representations compared to text. Moreover, the spatial and information-rich nature of vision tokens introduces structured attention patterns that make many LLM-oriented KV cache compression techniques ineffective when applied directly to VLMs. In this work, we conduct a detailed empirical analysis of the behavior of vision tokens, highlighting the critical differences from purely text-based models. Based on these insights, we propose KVCapsule, a novel KV cache compression framework for vision tokens. KVCapsule keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and can be integrated into existing VLMs through lightweight compression and reconstruction components. We evaluate KVCapsule on multiple VLMs and benchmark tasks, demonstrating up to 2x improvement in TPS and 2.4x reduction in KV cache memory at a 60% compression ratio, with negligible degradation in accuracy or response quality. Our findings offer practical pathways to scale VLM inference under constrained memory budgets and inspire further research into structure-aware cache compression for multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

KV Cache Compression

Autoregressive Decoding

Memory Overhead

Multimodal Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression

Vision-Language Models

Asymmetric Redundancy