Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

📅 2024-04-07
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the spatial cognition challenge of maintaining long-term 3D localization of objects that temporarily leave the field of view in first-person videos. To this end, we propose the Lift, Match, and Keep (LMK) framework—the first systematic approach to modeling “out-of-sight but not forgotten” spatiotemporal memory. LMK lifts 2D visual features into metric 3D space via monocular depth estimation, then performs cross-frame joint matching based on appearance, 3D position, and interaction cues, and finally enforces trajectory persistence for active objects even during prolonged occlusion or out-of-frame periods. Evaluated on the EPIC-KITCHENS long-video benchmark, LMK achieves 57% 3D localization accuracy after 120 seconds—substantially surpassing prior state-of-the-art methods (33% for 3D tracking and 17% for 2D tracking). This work establishes a novel paradigm for embodied intelligence by enabling robust, memory-augmented 3D object tracking in dynamic, egocentric environments.

Technology Category

Application Category

📝 Abstract
As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of their sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind - 3D tracking active objects using observations captured through an egocentric camera. We introduce a simple but effective approach to address this challenging problem, called Lift, Match, and Keep (LMK). LMK lifts partial 2D observations to 3D world coordinates, matches them over time using visual appearance, 3D location and interactions to form object tracks, and keeps these object tracks even when they go out-of-view of the camera. We benchmark LMK on 100 long videos from EPIC-KITCHENS. Our results demonstrate that spatial cognition is critical for correctly locating objects over short and long time scales. E.g., for one long egocentric video, we estimate the 3D location of 50 active objects. After 120 seconds, 57% of the objects are correctly localised by LMK, compared to just 33% by a recent 3D method for egocentric videos and 17% by a general 2D tracking method.
Problem

Research questions and friction points this paper is trying to address.

Mimic human spatial cognition from egocentric video
Track 3D locations of objects when out of sight
Maintain object trajectories during camera occlusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lift partial 2D observations to 3D coordinates
Match objects using appearance, location and interactions
Maintain object tracks even when out-of-view
🔎 Similar Papers
No similar papers found.