AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction

📅 2024-11-19
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-video understanding methods predominantly rely on unimodal visual feature compression, neglecting dynamic cross-modal interactions between visual content and textual queries—leading to weak cross-modal alignment, severe information distortion, and poor performance on complex video question answering. To address this, we propose Autoregressive Adaptive Multimodal Memory Reduction (AMR), the first framework enabling joint vision–language memory compression over video streams. AMR introduces learnable cross-modal attention gating and query-aware dynamic token pruning, jointly updating visual and textual memory in an autoregressive manner. It breaks away from fixed-size compression paradigms, significantly enhancing long-range dependency modeling and fine-grained cross-modal alignment. Evaluated on multiple benchmarks including LVU, AMR achieves state-of-the-art performance, improving average accuracy by 4.5% across video QA, captioning, and classification tasks, while reducing GPU memory consumption by 65%.

Technology Category

Application Category

📝 Abstract
The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM$^2$, which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM$^2$ achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in processing long-duration videos with LLM-based models.
Introduces adaptive cross-modality memory reduction for video-text alignment.
Improves performance in video understanding tasks while reducing memory usage.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive cross-modality memory reduction approach
Auto-regressive video-text alignment on streams
Significant memory usage reduction and performance improvement
Y
Yuanbin Man
Department of CSE, UT Arlington
Y
Ying Huang
Department of CSE, UT Arlington
Chengming Zhang
Chengming Zhang
University of Houston
Deep learningNLPHPC
Bingzhe Li
Bingzhe Li
Assistant Professor of Computer Science, University of Texas at Dallas
Intelligent storage systemsSystems for AI/MLDNA storage
W
Wei Niu
School of Computing, University of Georgia
M
Miao Yin
Department of CSE, UT Arlington