π€ AI Summary
This work addresses the high computational overhead and excessive KV cache consumption in multimodal large language models, primarily caused by redundant visual tokens from the vision encoder. Existing token pruning methods often compromise cache integrity, adversely affecting long-text generation. The authors first observe that visual attention patterns across more than half of the decoder layers exhibit strong similarity. Leveraging this insight, they propose Lazy Attentionβa mechanism that enables cross-layer sharing of similar attention maps and introduces a lightweight Q Cache for query reuse. The approach is compatible with existing inference frameworks, orthogonal to token pruning techniques, and supports FlashAttention. Experiments demonstrate over 35% reduction in KV cache usage and a 1.5Γ throughput improvement across multiple benchmarks, with only ~1% performance degradation while still outperforming current state-of-the-art pruning methods in accuracy.
π Abstract
Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.