Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of existing KV cache eviction strategies in multimodal large language models, which fail to account for the heterogeneous attention distributions between textual and visual tokens, thereby compromising both performance and memory efficiency. To resolve this, the authors propose HAE, a hierarchical adaptive eviction framework that introduces a layered cache management mechanism tailored to the distinct attention characteristics of text and vision tokens. During prefilling, HAE employs dual attention pruning; during decoding, it dynamically evicts tokens based on attention variance and visual token sparsity. The framework further integrates index broadcasting and an OS-inspired recycle-bin mechanism to enable efficient, low-error cache reclamation. Evaluated on Phi-3.5-Vision-Instruct, HAE reduces KV cache memory by 41% with only a 0.3% drop in image understanding accuracy, while accelerating story generation inference by 1.5× without degrading output quality.

Technology Category

Application Category

📝 Abstract
The integration of visual information into Large Language Models (LLMs) has enabled Multimodal LLMs (MLLMs), but the quadratic memory and computational costs of Transformer architectures remain a bottleneck. Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens, leading to suboptimal efficiency or degraded performance. In this paper, we propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimizes text-visual token interaction in MLLMs by implementing Dual-Attention Pruning during pre-filling (leveraging visual token sparsity and attention variance) and a Dynamic Decoding Eviction Strategy (inspired by OS Recycle Bins) during decoding. HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds compared to greedy strategies, enhancing efficiency in both comprehension and generation tasks. Empirically, HAE reduces KV-Cache memory by 41\% with minimal accuracy loss (0.3\% drop) in image understanding tasks and accelerates story generation inference by 1.5x while maintaining output quality on Phi3.5-Vision-Instruct model.
Problem

Research questions and friction points this paper is trying to address.

KV cache eviction
multimodal language models
attention distribution heterogeneity
visual-text token interaction
Transformer memory bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache eviction
multimodal LLMs
adaptive pruning
dynamic decoding
hierarchical eviction
X
Xindian Ma
College of Intelligence and Computing, Tianjin University, Tianjin, China
Y
Yidi Lu
College of Intelligence and Computing, Tianjin University, Tianjin, China
Peng Zhang
Peng Zhang
Professor, Tianjin University
Information RetrievalMachine LearningNatural Language Processing
Jing Zhang
Jing Zhang
Tianjin University
Artificial IntelligenceQuantum Tensor Network