Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

📅 2024-11-25

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing episodic memory retrieval methods rely on offline, full-video access, making them incompatible with the power and storage constraints of wearable devices. To address this, we propose the Online Visual Query 2D (OVQ2D) task and introduce the first online episodic memory visual query localization (OEM-VQL) paradigm for egocentric video streams: models process frames only once and maintain a compact spatiotemporal object memory to support real-time voice or visual queries (e.g., “Where did I put my phone?”). We design ESOM, a lightweight framework integrating self-supervised object discovery, online object tracking, and retrievable spatiotemporal memory modeling. Evaluated on our newly constructed OEM-VQL benchmark, ESOM achieves 81.92% localization accuracy—substantially outperforming offline baselines (55.89%). Furthermore, we establish the first principled egocentric downstream evaluation benchmark for OEM-VQL.

Technology Category

Application Category

📝 Abstract

Episodic memory retrieval aims to enable wearable devices with the ability to recollect from past video observations objects or events that have been observed (e.g.,"where did I last see my smartphone?"). Despite the clear relevance of the task for a wide range of assistive systems, current task formulations are based on the"offline"assumption that the full video history can be accessed when the user makes a query, which is unrealistic in real settings, where wearable devices are limited in power and storage capacity. We introduce the novel task of Online Episodic Memory Visual Queries Localization (OEM-VQL), in which models are required to work in an online fashion, observing video frames only once and relying on past computations to answer user queries. To tackle this challenging task, we propose ESOM - Egocentric Streaming Object Memory, a novel framework based on an object discovery module to detect potentially interesting objects, a visual object tracker to track their position through the video in an online fashion, and a memory module to store spatio-temporal object coordinates and image representations, which can be queried efficiently at any moment. Comparisons with different baselines and offline methods show that OEM-VQL is challenging and ESOM is a viable approach to tackle the task, with results outperforming offline methods (81.92 vs 55.89 success rate %) when oracular object discovery and tracking are considered. Our analysis also sheds light on the limited performance of object detection and tracking in egocentric vision, providing a principled benchmark based on the OEM-VQL downstream task to assess progress in these areas.

Problem

Research questions and friction points this paper is trying to address.

Enable real-time object localization in wearable camera videos

Overcome limitations of offline episodic memory retrieval systems

Improve efficiency with compact memory and online processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online video processing with single frame observation

Compact memory for object localization retrieval

Integration of object discovery and tracking modules

🔎 Similar Papers

Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind