🤖 AI Summary
Existing vision-language models lack explicit modeling of camera motion, hindering their ability to achieve authentic three-dimensional spatial understanding. To address this limitation, this work proposes the Spatial Narrative Score (SNS) evaluation framework, which requires models to generate explicit spatial narratives that integrate scene semantics with camera motion dynamics. The authors introduce CaMo, a novel model that, for the first time, incorporates camera motion awareness into vision-language training. By leveraging techniques such as spatial narrative generation, frozen large language model inference, and camera motion grounding, CaMo demonstrates robust performance on both the SNS benchmark and spatial question-answering tasks. These results not only underscore the critical role of explicit spatial narratives in enabling transferable 3D spatial understanding but also expose fundamental limitations of current models in achieving genuine spatial intelligence.
📝 Abstract
Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at https://github.com/hsiangwei0903/CaMo