π€ AI Summary
This work addresses the significant memory and computational bottlenecks caused by the rapid growth of KV cache during inference in vision-language models. To mitigate this, the authors propose a parameter-efficient, multimodal-aware framework that transforms off-the-shelf models into a multi-head latent attention architecture to compress the KV cache and accelerate inference. Key innovations include a modality-adaptive partial RoPE mechanism, modality-disentangled low-rank approximation, and an efficient fine-tuning strategy that minimizes output activation error. With minimal supervised data, the method effectively recovers the original model performance across three mainstream vision-language models while substantially reducing KV cache memory footprint and maintaining natural compatibility with KV quantization techniques.
π Abstract
As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.