Spatial-Conditioned Reasoning in Long-Egocentric Videos

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenges posed by long-horizon first-person videos, where viewpoint drift and the absence of persistent geometric context severely hinder visual navigation and spatial reasoning. To enhance the spatially conditioned reasoning capabilities of off-the-shelf vision-language models without altering their architecture or inference pipeline, the authors propose an input-level inductive bias that explicitly fuses RGB frames with depth maps as spatial signals. To support evaluation of navigation-oriented spatial queries, they introduce Sanpo-D, a fine-grained annotated dataset. Experimental results demonstrate that this approach significantly improves performance on safety-critical tasks such as pedestrian detection and obstacle recognition, while also revealing a fundamental trade-off between general-purpose capabilities and spatial specialization in vision-language models.

Technology Category

Application Category

📝 Abstract

Long-horizon egocentric video presents significant challenges for visual navigation due to viewpoint drift and the absence of persistent geometric context. Although recent vision-language models perform well on image and short-video reasoning, their spatial reasoning capability in long egocentric sequences remains limited. In this work, we study how explicit spatial signals influence VLM-based video understanding without modifying model architectures or inference procedures. We introduce Sanpo-D, a fine-grained re-annotation of the Google Sanpo dataset, and benchmark multiple VLMs on navigation-oriented spatial queries. To examine input-level inductive bias, we further fuse depth maps with RGB frames and evaluate their impact on spatial reasoning. Our results reveal a trade-off between general-purpose accuracy and spatial specialization, showing that depth-aware and spatially grounded representations can improve performance on safety-critical tasks such as pedestrian and obstruction detection.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

egocentric video

visual navigation

viewpoint drift

geometric context

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial reasoning

egocentric video

vision-language models