🤖 AI Summary
This work addresses zero-shot spatial question answering in egocentric videos—specifically, reasoning about 3D object locations, scene functionality, and directional relationships without fine-tuning or 3D sensor inputs. The authors propose SpatioRoute, a method that leverages a dynamic prompt routing mechanism to automatically select between rule-driven (SpatioRoute-R) and large language model–driven (SpatioRoute-L) semantic prompting templates based on question type, relying solely on video input for spatial reasoning. Notably, the study finds that chain-of-thought (CoT) prompting degrades performance on this task. Evaluated on the SQA3D benchmark, SpatioRoute achieves up to a 5% absolute improvement in accuracy over prior methods, setting a new state of the art for pure video-based, zero-shot spatial visual question answering.
📝 Abstract
Spatial question answering over egocentric video is a challenging task that requires Vision-Language Models (VLMs) to reason about 3D object positions, scene affordances, and directional relationships, particularly in the zero-shot setting where no task-specific fine-tuning is available. We introduce SpatioRoute, a dynamic prompt generation approach that routes each incoming question to a semantically tailored prompt template -- without any additional training, fine-tuning, or 3D sensor input. SpatioRoute operates in two complementary modes: SpatioRoute-R, a rule-based router that deterministically maps question typologies (e.g., What, Is, How, Can, Which) to specialized prompt templates; and SpatioRoute-L, an LLM-driven approach that generates task-specific prompts from the question and situational context alone, with no video input at routing time. We evaluate SpatioRoute on the SQA3D benchmark across VLMs spanning model families. SpatioRoute achieves consistent overall accuracy gains up to 5% over fixed prompt baselines, establishing a new state-of-the-art for zero-shot video-only spatial VQA without requiring 3D point-cloud inputs. As an additional finding, we observe that Chain-of-Thought (CoT) prompting, implemented via the Think it Twice architecture, consistently degrades performance in this setting on Qwen series models, confirming that question-aware routing is more effective than uniform reasoning instructions for spatial video understanding.