VideoLucy: Deep Memory Backtracking for Long Video Understanding

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based agents struggle with long-video understanding due to insufficient temporal context modeling and critical information loss from sparse frame sampling. To address these challenges, we propose VideoLucy—a novel framework featuring a hierarchical memory architecture that enables coarse-to-fine temporal reasoning over extended video sequences. It incorporates an agent-driven iterative backtracking mechanism that emulates human recall to dynamically capture inter-frame dynamics and salient details. Furthermore, we introduce a question-guided memory retrieval and aggregation strategy to enhance task-relevant contextual grounding. We also release EgoMem, the first benchmark specifically designed for evaluating ultra-long first-person videos. Extensive experiments across multiple long-video understanding benchmarks demonstrate that VideoLucy significantly outperforms state-of-the-art methods—including proprietary models such as GPT-4o—validating the efficacy, efficiency, and scalability of open-source architectures for long-horizon video understanding.

Technology Category

Application Category

📝 Abstract
Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model's ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at https://videolucy.github.io
Problem

Research questions and friction points this paper is trying to address.

Capturing temporal context in consecutive video frames
Preventing critical information loss from sparse frame sampling
Understanding complex events in extremely long videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical memory structure with progressive granularity
Agent-based iterative backtracking mechanism for deep memory
Systematic mining of video-wide question-relevant information
🔎 Similar Papers
No similar papers found.
Jialong Zuo
Jialong Zuo
Zhejiang University
Speech SynthesisVoice Conversion
Y
Yongtai Deng
National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Lingdong Kong
Lingdong Kong
National University of Singapore
Computer VisionDeep Learning
Jingkang Yang
Jingkang Yang
PhD, MMLab@NTU
Visual PerceptionVisual ReasoningMultimodalityOpen World
Rui Jin
Rui Jin
National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Y
Yiwei Zhang
National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Nong Sang
Nong Sang
Huazhong University of Science and Technology
Computer Vision and Pattern Recognition
L
Liang Pan
Shanghai AI Lab
Ziwei Liu
Ziwei Liu
Associate Professor, Nanyang Technological University
Computer VisionMachine LearningComputer Graphics
C
Changxin Gao
National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology