🤖 AI Summary
This study investigates whether purely video-based generative models can acquire human-like visual-spatial intelligence solely from visual inputs. To this end, we propose Video4Spatial—a video diffusion framework that performs end-to-end spatial understanding and planning without geometric priors such as depth maps or camera poses, relying exclusively on spatiotemporal context in videos. Our key contribution is the first empirical demonstration that pure video diffusion models can solve complex tasks requiring 3D geometric reasoning—including scene navigation and object localization—while supporting long-context modeling and cross-domain generalization. Through curated video dataset construction, scene-context-conditional generation, and a spatial consistency constraint mechanism, the model significantly improves instruction-following fidelity and spatial coherence in semantic localization and path planning. It further exhibits strong zero-shot generalization to unseen environments.
📝 Abstract
We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.