Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study investigates whether purely video-based generative models can acquire human-like visual-spatial intelligence solely from visual inputs. To this end, we propose Video4Spatial—a video diffusion framework that performs end-to-end spatial understanding and planning without geometric priors such as depth maps or camera poses, relying exclusively on spatiotemporal context in videos. Our key contribution is the first empirical demonstration that pure video diffusion models can solve complex tasks requiring 3D geometric reasoning—including scene navigation and object localization—while supporting long-context modeling and cross-domain generalization. Through curated video dataset construction, scene-context-conditional generation, and a spatial consistency constraint mechanism, the model significantly improves instruction-following fidelity and spatial coherence in semantic localization and path planning. It further exhibits strong zero-shot generalization to unseen environments.

Technology Category

Application Category

📝 Abstract

We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

Problem

Research questions and friction points this paper is trying to address.

Enabling video models to perform complex spatial tasks

Validating scene navigation and object grounding using video-only inputs

Advancing video generative models toward general visuospatial reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video diffusion models conditioned on video-based scene context

Performs spatial tasks using only video inputs without auxiliary modalities

End-to-end planning for navigation and object grounding with spatial consistency

🔎 Similar Papers

Video In-context Learning

2024-07-10arXiv.orgCitations: 3

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence