Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

260K/year

🤖 AI Summary

This work addresses the limitations of existing video understanding datasets, which predominantly rely on densely edited, scripted clips that fail to capture the complexity of real-life scenarios, and highlights how current models struggle with long-term multimodal sequences due to working memory bottlenecks and global localization collapse. To bridge this gap, the authors introduce MM-Lifelong—a 181.1-hour dataset of authentic daily-life videos spanning temporal scales from days to months—and propose ReMA (Recursive Multimodal Agent), a novel architecture that enables long-term multimodal understanding through dynamic memory management and recursive belief state modeling. ReMA substantially outperforms existing approaches in mitigating these failure modes. The dataset includes a carefully designed split supporting both supervised learning and out-of-distribution generalization, establishing the first benchmark and solution for long-term multimodal understanding in real-world settings.

Technology Category

Application Category

📝 Abstract

While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Lifelong Understanding

video understanding

long-term temporal modeling

working memory bottleneck

global localization collapse

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Lifelong Understanding

Recursive Multimodal Agent

Working Memory Bottleneck