Episodic Memory Representation for Long-form Video Understanding

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video-LLMs face a fundamental bottleneck in long-video understanding, primarily due to limited context windows and existing keyframe selection methods that neglect spatiotemporal dynamics and narrative coherence. To address this, we propose Video-EM—a training-free framework that models keyframes as temporally ordered episodic memory events, explicitly preserving scene transitions and contextual continuity. By integrating chain-of-thought reasoning from large language models, Video-EM enables context-aware narrative reconstruction and question answering. Its core innovation lies in an episodic memory representation mechanism that replaces static image matching with event sequences, augmented by temporal guidance for memory filtering to enhance information density. Evaluated on Video-MME, EgoSchema, HourVideo, and LVBench, Video-EM achieves 4–9% higher accuracy than state-of-the-art baselines while using significantly fewer keyframes.

Technology Category

Application Category

📝 Abstract
Video Large Language Models (Video-LLMs) excel at general video understanding but struggle with long-form videos due to context window limits. Consequently, recent approaches focus on keyframe retrieval, condensing lengthy videos into a small set of informative frames. Despite their practicality, these methods simplify the problem to static text image matching, overlooking spatio temporal relationships crucial for capturing scene transitions and contextual continuity, and may yield redundant keyframes with limited information, diluting salient cues essential for accurate video question answering. To address these limitations, we introduce Video-EM, a training free framework inspired by the principles of human episodic memory, designed to facilitate robust and contextually grounded reasoning. Rather than treating keyframes as isolated visual entities, Video-EM explicitly models them as temporally ordered episodic events, capturing both spatial relationships and temporal dynamics necessary for accurately reconstructing the underlying narrative. Furthermore, the framework leverages chain of thought (CoT) thinking with LLMs to iteratively identify a minimal yet highly informative subset of episodic memories, enabling efficient and accurate question answering by Video-LLMs. Extensive evaluations on the Video-MME, EgoSchema, HourVideo, and LVBench benchmarks confirm the superiority of Video-EM, which achieves highly competitive results with performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
Problem

Research questions and friction points this paper is trying to address.

Video-LLMs struggle with long-form video understanding due to context limits
Current keyframe methods ignore spatio-temporal relationships and scene transitions
Existing approaches produce redundant keyframes, diluting salient video question answering cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Models keyframes as temporally ordered episodic events
Leverages chain of thought thinking with LLMs
Identifies minimal yet informative episodic memory subset
🔎 Similar Papers
No similar papers found.
Y
Yun Wang
City University of Hong Kong
L
Long Zhang
University of Science and Technology of China
Jingren Liu
Jingren Liu
PhD student, Tianjin University
Continual LearningLong-form Video UnderstandingUnified Models
J
Jiaqi Yan
Nanjing University
Zhanjie Zhang
Zhanjie Zhang
Zhejiang University
computer vision
J
Jiahao Zheng
City University of Hong Kong
X
Xun Yang
University of Science and Technology of China
Dapeng Wu
Dapeng Wu
Chongqing University of Posts and Telecommunications
Wireless NetworkSocial Computing
X
Xiangyu Chen
The Institute of Artificial Intelligence (TeleAI), China Telecom
X
Xuelong Li
The Institute of Artificial Intelligence (TeleAI), China Telecom