Grounded World Model for Semantically Generalizable Planning

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the limitations of traditional visual model predictive control (MPC), which relies on target images for planning and thus struggles in novel environments where such targets are unavailable, while also lacking natural language interaction and semantic generalization capabilities. To overcome these challenges, the authors propose the Grounded World Model (GWM), which integrates pretrained vision encoders—such as DINO or JEPA—with vision-language aligned embeddings within a shared latent space. Planning is achieved by scoring action proposals based on the similarity between predicted future states and task instruction embeddings, eliminating the need for target images. This enables MPC to directly reason from natural language instructions. Evaluated on the WISER benchmark with 288 unseen tasks, GWM achieves an 87% success rate, substantially outperforming conventional vision-language policies, which attain only 22%.

Technology Category

Application Category

📝 Abstract

In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.

Problem

Research questions and friction points this paper is trying to address.

Model Predictive Control

Visuomotor Planning

Semantic Generalization

Vision-Language Alignment

Goal Specification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Grounded World Model

Model Predictive Control

Vision-Language Alignment