Make Your LVLM KV Cache More Lightweight

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

234K/year
🤖 AI Summary
This work addresses the excessive memory consumption of key-value (KV) cache in large vision-language models (LVLMs) during inference, primarily caused by processing a large number of visual tokens. The authors propose LightKV, the first method to introduce a text-prompt-aware cross-modal guidance mechanism that dynamically aggregates and compresses visual token embeddings through cross-modal message passing during the prefill stage, substantially reducing KV cache redundancy. Unlike existing approaches that rely solely on visual information for compression, LightKV achieves more precise and efficient token reduction—halving the KV cache size and cutting computational cost by 40% while retaining only 55% of the original visual tokens. Remarkably, it maintains or even improves general performance across eight benchmark datasets, significantly outperforming current state-of-the-art methods.
📝 Abstract
Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.
Problem

Research questions and friction points this paper is trying to address.

KV cache
Large Vision-Language Models
GPU memory overhead
vision tokens
inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

LightKV
KV cache compression
vision-language models
cross-modality message passing
prompt-aware compression
🔎 Similar Papers
No similar papers found.