World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenge that vision-language models (VLMs) struggle to reason about dynamic spatial transformations under egocentric motion. To overcome this limitation, the authors propose World2VLM, a novel training framework that, for the first time, leverages a generative world model as a teacher to distill its spatial imagination into the VLM during training. This enables the VLM to internalize both forward and inverse spatial relationships between actions and their outcomes. The approach synthesizes geometrically aligned future viewpoints from the view-consistent world model to construct structured supervision signals and employs a two-stage post-training strategy for optimization. Experiments demonstrate that World2VLM significantly outperforms existing baselines across multiple benchmarks—including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube—surpassing even inference-time generative coupling methods while avoiding their computational overhead.

📝 Abstract

Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. We post-train the VLM with a two-stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. It also outperforms the test-time world-model-coupled methods while eliminating the need for expensive inference-time generation. Our results suggest that world models can serve not only as inference-time tools, but also as effective training-time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.

Problem

Research questions and friction points this paper is trying to address.

dynamic spatial reasoning

vision-language models

world models

egocentric motion

spatial imagination

Innovation

Methods, ideas, or system contributions that make the work stand out.

World2VLM

world model distillation

dynamic spatial reasoning