Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether purely video-based generative models can acquire human-like visual-spatial intelligence solely from visual inputs. To this end, we propose Video4Spatial—a video diffusion framework that performs end-to-end spatial understanding and planning without geometric priors such as depth maps or camera poses, relying exclusively on spatiotemporal context in videos. Our key contribution is the first empirical demonstration that pure video diffusion models can solve complex tasks requiring 3D geometric reasoning—including scene navigation and object localization—while supporting long-context modeling and cross-domain generalization. Through curated video dataset construction, scene-context-conditional generation, and a spatial consistency constraint mechanism, the model significantly improves instruction-following fidelity and spatial coherence in semantic localization and path planning. It further exhibits strong zero-shot generalization to unseen environments.

Technology Category

Application Category

📝 Abstract
We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enabling video models to perform complex spatial tasks
Validating scene navigation and object grounding using video-only inputs
Advancing video generative models toward general visuospatial reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video diffusion models conditioned on video-based scene context
Performs spatial tasks using only video inputs without auxiliary modalities
End-to-end planning for navigation and object grounding with spatial consistency
🔎 Similar Papers
No similar papers found.
Z
Zeqi Xiao
Netflix, Nanyang Technological University
Y
Yiwei Zhao
Netflix
L
Lingxiao Li
Netflix
Yushi Lan
Yushi Lan
VGG@Oxford, UK
Computer VisionComputer Graphics
Y
Yu Ning
Netflix Eyeline Studios
R
Rahul Garg
Netflix
R
Roshni Cooper
Netflix
M
M. H. Taghavi
Netflix
Xingang Pan
Xingang Pan
Assistant Professor, MMLab@NTU, Nanyang Technological University
Computer VisionDeep LearningComputer Graphics