🤖 AI Summary
This work addresses the challenges posed by long-horizon first-person videos, where viewpoint drift and the absence of persistent geometric context severely hinder visual navigation and spatial reasoning. To enhance the spatially conditioned reasoning capabilities of off-the-shelf vision-language models without altering their architecture or inference pipeline, the authors propose an input-level inductive bias that explicitly fuses RGB frames with depth maps as spatial signals. To support evaluation of navigation-oriented spatial queries, they introduce Sanpo-D, a fine-grained annotated dataset. Experimental results demonstrate that this approach significantly improves performance on safety-critical tasks such as pedestrian detection and obstacle recognition, while also revealing a fundamental trade-off between general-purpose capabilities and spatial specialization in vision-language models.
📝 Abstract
Long-horizon egocentric video presents significant challenges for visual navigation due to viewpoint drift and the absence of persistent geometric context. Although recent vision-language models perform well on image and short-video reasoning, their spatial reasoning capability in long egocentric sequences remains limited. In this work, we study how explicit spatial signals influence VLM-based video understanding without modifying model architectures or inference procedures. We introduce Sanpo-D, a fine-grained re-annotation of the Google Sanpo dataset, and benchmark multiple VLMs on navigation-oriented spatial queries. To examine input-level inductive bias, we further fuse depth maps with RGB frames and evaluate their impact on spatial reasoning. Our results reveal a trade-off between general-purpose accuracy and spatial specialization, showing that depth-aware and spatially grounded representations can improve performance on safety-critical tasks such as pedestrian and obstruction detection.