Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing video understanding datasets, which predominantly rely on densely edited, scripted clips that fail to capture the complexity of real-life scenarios, and highlights how current models struggle with long-term multimodal sequences due to working memory bottlenecks and global localization collapse. To bridge this gap, the authors introduce MM-Lifelong—a 181.1-hour dataset of authentic daily-life videos spanning temporal scales from days to months—and propose ReMA (Recursive Multimodal Agent), a novel architecture that enables long-term multimodal understanding through dynamic memory management and recursive belief state modeling. ReMA substantially outperforms existing approaches in mitigating these failure modes. The dataset includes a carefully designed split supporting both supervised learning and out-of-distribution generalization, establishing the first benchmark and solution for long-term multimodal understanding in real-world settings.

Technology Category

Application Category

📝 Abstract
While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Lifelong Understanding
video understanding
long-term temporal modeling
working memory bottleneck
global localization collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Lifelong Understanding
Recursive Multimodal Agent
Working Memory Bottleneck
Global Localization Collapse
Dynamic Memory Management
🔎 Similar Papers