Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models are constrained by input length limitations, hindering their ability to effectively comprehend long videos. This work proposes FlexMem, a training-free visual memory mechanism that emulates human-like on-demand recall behavior. FlexMem enables efficient writing and migration of visual key-value (KV) caches through a dual-path compression strategy and supports flexible memory retrieval policies tailored to diverse tasks. To the best of our knowledge, this is the first approach to incorporate human-inspired memory mechanisms into long video understanding, overcoming the constraint of processing all frames simultaneously. Implemented on a single RTX 3090 GPU, FlexMem handles videos exceeding 1,000 frames and achieves state-of-the-art performance across five long-video benchmarks and one streaming task, surpassing existing efficient methods and, in some cases, matching or even outperforming GPT-4o and Gemini-1.5 Pro.
📝 Abstract
Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.
Problem

Research questions and friction points this paper is trying to address.

Long video understanding
Multimodal Large Language Models
Visual memory mechanism
Input length limitation
Video comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Memory Mechanism
Training-Free
Long Video Understanding
KV Cache Compression
Streaming Video
🔎 Similar Papers
No similar papers found.
T
Tao Chen
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Kun Zhang
Kun Zhang
Renmin University of China
simulation optimizationnested simulationmachine learningfinancial engineering
Qiong Wu
Qiong Wu
Xiamen University
Computer VisionPerson Re-IdentificationVision-Language
Xiao Chen
Xiao Chen
Tsinghua University
AI
C
Chao Chang
National University of Defense Technology, 230000, P.R. China.
X
Xiaoshuai Sun
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Yiyi Zhou
Yiyi Zhou
Xiamen University
deep learninglanguage and vision
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.