Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-video understanding models suffer from prohibitive memory and computational overhead, struggling to balance performance and efficiency. To address this, we propose a task-aware KV sparsification framework integrating chunked prefilling with a dual-level KV decoding mechanism: intra-chunk full attention preserves fine-grained temporal modeling, while inter-chunk sparse attention dynamically selects task-relevant key-value pairs based on semantic relevance, enabling efficient KV cache compression. Our method substantially reduces computational cost for long-sequence processing. It achieves state-of-the-art performance among open-source lightweight multimodal large language models (MLLMs) across multiple long-video understanding benchmarks. On a single A100 GPU, it enables real-time inference on videos exceeding 10,000 frames—processing thousands of frames in just seconds—marking the first approach to jointly achieve high accuracy and high efficiency at the ten-thousand-frame scale.

Technology Category

Application Category

📝 Abstract
Multi-modal large language models (MLLMs) models have made significant progress in video understanding over the past few years. However, processing long video inputs remains a major challenge due to high memory and computational costs. This makes it difficult for current models to achieve both strong performance and high efficiency in long video understanding. To address this challenge, we propose Video-XL-2, a novel MLLM that delivers superior cost-effectiveness for long-video understanding based on task-aware KV sparsification. The proposed framework operates with two key steps: chunk-based pre-filling and bi-level key-value decoding. Chunk-based pre-filling divides the visual token sequence into chunks, applying full attention within each chunk and sparse attention across chunks. This significantly reduces computational and memory overhead. During decoding, bi-level key-value decoding selectively reloads either dense or sparse key-values for each chunk based on its relevance to the task. This approach further improves memory efficiency and enhances the model's ability to capture fine-grained information. Video-XL-2 achieves state-of-the-art performance on various long video understanding benchmarks, outperforming existing open-source lightweight models. It also demonstrates exceptional efficiency, capable of processing over 10,000 frames on a single NVIDIA A100 (80GB) GPU and thousands of frames in just a few seconds.
Problem

Research questions and friction points this paper is trying to address.

Addressing high memory and computational costs in long-video understanding
Improving efficiency and performance in multi-modal large language models
Enhancing fine-grained information capture in long-video processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-aware KV sparsification for efficiency
Chunk-based pre-filling with sparse attention
Bi-level key-value decoding for relevance
🔎 Similar Papers
No similar papers found.
M
Minghao Qin
Beijing Academy of Artificial Intelligence
X
Xiangrui Liu
Beijing Academy of Artificial Intelligence, Shanghai Jiao Tong University
Zhengyang Liang
Zhengyang Liang
Singapore Management University
MultimodalComputer Vision
Yan Shu
Yan Shu
University of Trento << Harbin Institute of Technology
Vision and LanguageMulti-modal LearningVideo UnderstandingOCRRemote Sensing
H
Huaying Yuan
Beijing Academy of Artificial Intelligence, Renmin University of China
J
Juenjie Zhou
Beijing Academy of Artificial Intelligence, Beijing University of Posts and Telecommunications
Shitao Xiao
Shitao Xiao
BUPT
B
Bo Zhao
Beijing Academy of Artificial Intelligence, Shanghai Jiao Tong University
Z
Zheng Liu
Beijing Academy of Artificial Intelligence, Hong Kong Polytechnic University