EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Existing long-term egocentric video benchmarks primarily focus on perceptual recognition and lack evaluation of memory-driven reasoning across days. This work proposes the first benchmark specifically designed for memory-driven reasoning in such videos, introducing a structured question-answering framework encompassing three types of memory tasks: entities, events, and behaviors. Each question links to an average of 5.1 video segments and requires recalling information spanning up to 25.9 hours, emphasizing cross-day integration and long-range temporal reasoning. The benchmark enables unified evaluation of multimodal large language models and embodied agents. Evaluation across 17 state-of-the-art methods reveals that even the best-performing model achieves only 39.6% accuracy, with performance markedly degrading as temporal span increases, underscoring long-term memory reasoning as a critical open challenge.

📝 Abstract

Next-generation visual assistants, such as smart glasses, embodied agents, and always-on life-logging systems, must reason over an entire day or more of continuous visual experience. In ultra-long video settings, relevant information is sparsely distributed across hours or days, making memory a fundamental challenge: models must accumulate information over time, recall prior states, track temporal order, and abstract recurring patterns. However, existing week-long video benchmarks are primarily designed for perception and recognition, such as moment localization or global summarization, rather than reasoning that requires integrating evidence across multiple days. To address this gap, we introduce EgoMemReason, a comprehensive benchmark that systematically evaluates week-long egocentric video understanding through memory-driven reasoning. EgoMemReason evaluates three complementary memory types: entity memory, tracking how object states evolve and change across days; event memory, recalling and ordering activities separated by hours or days; and behavior memory, abstracting recurring patterns from sparse, repeated observations over the whole week period. EgoMemReason comprises 500 questions across three memory types and six core challenges, with an average of 5.1 video segments of evidence per question and 25.9 hours of memory backtracking. We evaluate EgoMemReason on 17 methods across MLLMs and agentic frameworks, revealing that even the best model achieves only 39.6% overall accuracy. Further analysis shows that the three memory types fail for distinct reasons and that performance degrades as evidence spans longer temporal horizons, revealing that long-horizon memory remains far from solved. We believe EgoMemReason establishes a strong foundation for evaluating and advancing long-context, memory-aware multimodal systems.

Problem

Research questions and friction points this paper is trying to address.

long-horizon reasoning

egocentric video understanding

memory-driven reasoning

temporal memory

multimodal benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

memory-driven reasoning

egocentric video understanding

long-horizon memory