EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

šŸ“… 2026-05-18
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF

career value

211K/year
šŸ¤– AI Summary
This work addresses the limitations of existing egocentric memory systems in supporting cross-viewpoint understanding for spatiotemporal reasoning. It introduces, for the first time, a cross-view memory reasoning task and benchmark, accompanied by a synchronized ego-exo video dataset comprising 2.6K high-quality multiple-choice questions, which reveals a systematic viewpoint bias between question formulation and answer localization. To tackle this challenge, the authors propose E²-Select, a training-free frame selection method that integrates relevance-based budget allocation with dual-viewpoint k-DPP sampling to enable collaborative retrieval, augmented with a RAG-based memory mechanism. Experiments demonstrate that E²-Select achieves an accuracy of 58.2%, significantly outperforming current multimodal large language models (MLLMs), whose best performance reaches only 55.3%, thereby validating the complementary nature of dual viewpoints and the inherent difficulty of the proposed task, and paving the way for new directions in multi-view memory modeling.
šŸ“ Abstract
Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos. EgoExoMem contains $2.6K$ high-quality MCQs across eight temporal, spatial, and cross-view QA types. To support dual-view retrieval, we propose E$^2$-Select, a training-free frame selection method for synchronized ego-exo videos. It combines relevance-based budget allocation with per-view k-DPP sampling to handle view asymmetry and cross-view temporal consistency. Experiments show that ego and exo views provide complementary memory cues, while existing MLLMs remain far from solving the benchmark: the best model reaches only $55.3\%$. E$^2$-Select achieves state-of-the-art performance of $58.2\%$ over frame-selection and RAG-based memory baselines. Further analysis reveals systematic view-preference conflicts between question framing and answer grounding, underscoring the novelty and challenge of cross-view memory reasoning.
Problem

Research questions and friction points this paper is trying to address.

egocentric
exocentric
memory reasoning
cross-view
spatial-temporal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-view memory reasoning
synchronized egocentric-exocentric videos
training-free frame selection
k-DPP sampling
view asymmetry
šŸ”Ž Similar Papers