Task-Aware KV Compression For Cost-Effective Long Video Understanding

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead of multimodal large language models (MLLMs) in long-video understanding (LVU) and the severe information loss incurred by existing key-value (KV) compression methods under high compression ratios, this paper proposes a task-aware dual-level KV compression and selective reloading mechanism. The method requires no additional training and is plug-and-play compatible with mainstream MLLMs. It jointly optimizes inference efficiency and critical information preservation by generating low- and high-compression-ratio KV caches in parallel and dynamically reloading compressed tokens based on importance scoring. Evaluated on benchmarks including VideoMME and MLVU, our approach significantly outperforms existing KV compression techniques, achieving substantial reductions in GPU memory consumption and FLOPs while simultaneously improving overall video understanding performance.

Technology Category

Application Category

📝 Abstract
Long-video understanding (LVU) remains a severe challenge for existing multimodal large language models (MLLMs), primarily due to the prohibitive computational cost. Recent approaches have explored KV compression to mitigate this issue, but they often suffer from significant information loss at high compression ratios. In this paper, we introduce Video-X^2L, which flexibly preserves critical video information for each LVU task. Video-X^2L involves two key operations. The first one is called bi-level KV compression. During the MLLM's pre-filling stage, Video-X^2L generates two types of compressed KVs: low-compression KVs (L-KVs) to capture fine-grained video details and high-compression KVs (H-KVs) to offer compact video representations. The second one is called selective KV re-loading. During the MLLM's decoding stage, Video-X^2L selectively re-loads L-KVs for the most critical video chunks while using H-KVs for other less important ones. This allows the MLLM to fully utilize task-specific information while maintaining the overall compactness. Video-X^2L is simple yet effective: it is free from additional training and directly compatible with existing KV-compressible MLLMs. We evaluate Video-X^2L with a variety of popular LVU benchmarks, including VideoMME, MLVU, LongVideoBench, and VNBench. Our experiment result shows that Video-X^2L outperforms existing KV-compression methods by a huge advantage while substantially saving the computation cost.
Problem

Research questions and friction points this paper is trying to address.

Reduce computational cost in long-video understanding tasks
Minimize information loss at high compression ratios
Enhance task-specific video information preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bi-level KV compression for video details
Selective KV re-loading for critical chunks
Training-free compatibility with existing MLLMs
🔎 Similar Papers
No similar papers found.
M
Minghao Qin
Beijing Academy of Artificial Intelligence
Yan Shu
Yan Shu
University of Trento << Harbin Institute of Technology
Vision and LanguageMulti-modal LearningVideo UnderstandingOCRRemote Sensing
P
Peitian Zhang
Beijing Academy of Artificial Intelligence, Renmin University of China
K
Kun Lun
Beijing Academy of Artificial Intelligence, Institute of Automation, CAS, Beijing, China
H
Huaying Yuan
Beijing Academy of Artificial Intelligence, Renmin University of China
J
Juenjie Zhou
Beijing Academy of Artificial Intelligence, Beijing University of Posts and Telecommunications
Shitao Xiao
Shitao Xiao
BUPT
B
Bo Zhao
Beijing Academy of Artificial Intelligence, Shanghai Jiao Tong University
Z
Zheng Liu
Beijing Academy of Artificial Intelligence, Hong Kong Polytechnic University