🤖 AI Summary
This work addresses the inefficiency of existing KV cache eviction strategies in multimodal large language models, which fail to account for the heterogeneous attention distributions between textual and visual tokens, thereby compromising both performance and memory efficiency. To resolve this, the authors propose HAE, a hierarchical adaptive eviction framework that introduces a layered cache management mechanism tailored to the distinct attention characteristics of text and vision tokens. During prefilling, HAE employs dual attention pruning; during decoding, it dynamically evicts tokens based on attention variance and visual token sparsity. The framework further integrates index broadcasting and an OS-inspired recycle-bin mechanism to enable efficient, low-error cache reclamation. Evaluated on Phi-3.5-Vision-Instruct, HAE reduces KV cache memory by 41% with only a 0.3% drop in image understanding accuracy, while accelerating story generation inference by 1.5× without degrading output quality.
📝 Abstract
The integration of visual information into Large Language Models (LLMs) has enabled Multimodal LLMs (MLLMs), but the quadratic memory and computational costs of Transformer architectures remain a bottleneck. Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens, leading to suboptimal efficiency or degraded performance. In this paper, we propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimizes text-visual token interaction in MLLMs by implementing Dual-Attention Pruning during pre-filling (leveraging visual token sparsity and attention variance) and a Dynamic Decoding Eviction Strategy (inspired by OS Recycle Bins) during decoding. HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds compared to greedy strategies, enhancing efficiency in both comprehension and generation tasks. Empirically, HAE reduces KV-Cache memory by 41\% with minimal accuracy loss (0.3\% drop) in image understanding tasks and accelerates story generation inference by 1.5x while maintaining output quality on Phi3.5-Vision-Instruct model.