Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

πŸ“… 2026-02-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the high computational overhead and excessive KV cache consumption in multimodal large language models, primarily caused by redundant visual tokens from the vision encoder. Existing token pruning methods often compromise cache integrity, adversely affecting long-text generation. The authors first observe that visual attention patterns across more than half of the decoder layers exhibit strong similarity. Leveraging this insight, they propose Lazy Attentionβ€”a mechanism that enables cross-layer sharing of similar attention maps and introduces a lightweight Q Cache for query reuse. The approach is compatible with existing inference frameworks, orthogonal to token pruning techniques, and supports FlashAttention. Experiments demonstrate over 35% reduction in KV cache usage and a 1.5Γ— throughput improvement across multiple benchmarks, with only ~1% performance degradation while still outperforming current state-of-the-art pruning methods in accuracy.

Technology Category

Application Category

πŸ“ Abstract
Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
KV Cache
Visual Tokens
Inference Cost
Attention Redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lazy Attention
Q Cache
cross-layer attention sharing
multimodal LLMs
KV cache optimization
πŸ”Ž Similar Papers
No similar papers found.
J
Jiedong Zhuang
Zhejiang University, Alibaba Cloud Computing
L
Lu Lu
Alibaba Cloud Computing
Ming Dai
Ming Dai
SouthEast University
MLLMVisual GroundingImage Retrieval
R
Rui Hu
Zhejiang University
Jian Chen
Jian Chen
Alibaba Group
processor architectureperformance modelingworkload characterization
Q
Qiang Liu
Alibaba Cloud Computing
Haoji Hu
Haoji Hu
Zhejiang Univeristy, China
Machine LearningComputer VisionDeep Learning