AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

To address the context-length limitation of multimodal large language models (MLLMs) in processing long videos, this paper proposes a training-free, adaptive video compression method. Unlike conventional uniform compression, our approach is the first to achieve theoretically grounded, heterogeneous redundancy reduction across both temporal steps and network layers. It jointly quantifies hierarchical temporal redundancy using information entropy and gradient sensitivity, dynamically optimizing compression ratios per layer and per frame while seamlessly integrating with existing MLLM architectures. Experiments demonstrate consistent improvements: average performance gains of 2.3% and 2.8% on four major benchmarks—including VideoMME—across 7B and 72B models, respectively; and up to 5.9% and 6.0% on the longest-video tasks of LVBench. Moreover, the method extends maximum supported input frames from 256 to 2048, significantly enhancing scalability for long-video understanding.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. Recent methods compress videos by leveraging visual redundancy uniformly, yielding promising results. Nevertheless, our quantitative analysis shows that redundancy varies significantly across time and model layers, necessitating a more flexible compression strategy. We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Integrated into state-of-the-art MLLMs, AdaReTaKe improves processing capacity from 256 to 2048 frames while preserving critical information. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively, with even greater improvements of 5.9% and 6.0% on the longest LVBench. Our code is available at https://github.com/SCZwangxiao/video-FlexReduc.git.

Problem

Research questions and friction points this paper is trying to address.

Addresses context length limitations in video understanding by MLLMs.

Proposes adaptive redundancy reduction for varying video and model layers.

Enhances processing capacity from 256 to 2048 frames effectively.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive redundancy reduction across time and layers

Training-free method with theoretical compression guarantees

Enhances MLLMs to process up to 2048 frames

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives