Can World Models Benefit VLMs for World Dynamics?

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work investigates whether world models can replace conventional visual encoders to enhance vision-language models’ (VLMs) general understanding of dynamic scenes. To this end, we propose WorldLM: a framework that leverages single-step denoising latent variables from video diffusion models as lightweight, spatiotemporally consistent visual representations, and introduces the Dynamic Vision Aligner (DyVA) module to endow single-image VLMs with multi-frame reasoning capability. To our knowledge, this is the first systematic study validating world models as general-purpose visual encoders. We further construct a comprehensive multi-task benchmark for dynamic visual reasoning. Experiments demonstrate that DyVA significantly outperforms leading open- and closed-source baselines across diverse visual reasoning tasks, achieving state-of-the-art or competitive performance—thereby confirming that world model priors effectively improve VLMs’ spatial and temporal understanding.

Technology Category

Application Category

📝 Abstract

Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant conventional vision encoder paradigms for general-purpose multimodal understanding? While recent studies have begun to explore the potential of world models on common vision tasks, these explorations typically lack a systematic investigation of generic, multimodal tasks. In this work, we strive to investigate the capabilities when world model priors are transferred into Vision-Language Models: we re-purpose a video diffusion model as a generative encoder to perform a single denoising step and treat the resulting latents as a set of visual embedding. We empirically investigate this class of models, which we refer to as World-Language Models (WorldLMs), and we find that generative encoders can capture latents useful for downstream understanding that show distinctions from conventional encoders. Naming our best-performing variant Dynamic Vision Aligner (DyVA), we further discover that this method significantly enhances spatial reasoning abilities and enables single-image models to perform multi-frame reasoning. Through the curation of a suite of visual reasoning tasks, we find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance. We attribute these gains to WorldLM's inherited motion-consistency internalization from video pre-training. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models and are on a promising path towards generalist vision learners.

Problem

Research questions and friction points this paper is trying to address.

Investigating world model priors integration into Vision-Language Models

Exploring generative encoders for enhanced multimodal understanding tasks

Enhancing spatial reasoning via motion-consistency from video pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video diffusion model repurposed as generative encoder

Single denoising step produces visual embeddings

World model priors enhance spatial reasoning capabilities

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions