Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the representational capabilities of vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. Using a frozen-feature probing approach, the analysis compares their performance across three dimensions: semantic labeling, instance grouping, and 3D geometric prediction. The work reveals, for the first time, a complementary relationship between VLMs and VGMs in spatial understanding: VLMs excel at semantic and instance-level recognition, whereas VGMs demonstrate superior modeling of geometric structure and camera motion dynamics. Notably, a simple fusion of their representations yields substantial gains in overall performance, simultaneously enhancing both semantic accuracy and geometric fidelity.
📝 Abstract
Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at \href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}.
Problem

Research questions and friction points this paper is trying to address.

spatial intelligence
Vision-Language Models
Video Generation Models
pretraining paradigm
visual representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial intelligence
vision-language models
video generation models
frozen-feature probing
representation fusion
🔎 Similar Papers