ReWind: Understanding Long Videos with Instructed Learnable Memory

๐Ÿ“… 2024-11-23
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address high computational cost, memory constraints, and weak temporal coherence in long-video understanding, this paper proposes an instruction-driven learnable memory architecture. The architecture dynamically compresses and updates visual representations via a โ€œreadโ€“perceiveโ€“writeโ€ cycle, integrates content-aware adaptive high-resolution frame sampling, and employs learnable queries with cross-attention to synergistically enable LLM-based reasoning. Key contributions include: (1) the first instruction-guided dynamic memory module; (2) a linear-complexity memory update mechanism; and (3) a frame selection paradigm that jointly optimizes temporal fidelity and efficiency. Experimental results demonstrate significant improvements: +12% accuracy and +13% VQA score on MovieChat-1K; and +8% mIoU on Charades-STA for temporal grounding.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel extbf{read-perceive-write} cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of tokens. In the second stage, we propose an adaptive frame selection mechanism guided by the memory content to identify instruction-relevant key moments. It enriches the memory representations with detailed spatial information by selecting a few high-resolution frames, which are then combined with the memory contents and fed into a Large Language Model (LLM) to generate the final answer. We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13% score gain and a +12% accuracy improvement on the MovieChat-1K VQA dataset and an +8% mIoU increase on Charades-STA for temporal grounding.
Problem

Research questions and friction points this paper is trying to address.

Efficient understanding of long videos with memory limitations
Maintaining coherent visual-textual understanding across extended sequences
Improving performance in VQA and temporal grounding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic learnable memory module with read-perceive-write cycle
Adaptive frame selection for key moments
Combines memory and frames for LLM input
๐Ÿ”Ž Similar Papers