Planning with Reasoning using Vision Language World Model

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing world models exhibit insufficient capacity for semantic and temporal abstraction, hindering agents’ understanding of high-level goals and long-horizon reasoning. To address this, we propose the Vision-Language World Model (VLWM), the first framework integrating Large Language Model (LLM)-driven self-refinement with tree-structured future captioning to realize an interpretable dual-system architecture—“System 1” (intuitive, rapid response) and “System 2” (reflective, deep planning). Our method unifies vision-language understanding, tree-based compressed future representation, self-supervised critique modeling, and cost-minimizing search to extract high-level action sequences and state evolution from video inputs. VLWM achieves state-of-the-art performance on VPA, RoboVQA, and WorldPrediction benchmarks. Human evaluation on PlannerArena demonstrates that System 2 improves Elo scores by 27% over System 1 alone, validating its efficacy in enabling deliberate, goal-directed planning.

Technology Category

Application Category

📝 Abstract

Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM Self-Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan decoding and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.

Problem

Research questions and friction points this paper is trying to address.

Developing high-level world models for semantic and temporal reasoning

Integrating vision and language for action and state prediction

Enhancing visual planning performance through self-supervised cost minimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Language World Model for semantic reasoning

Self-supervised critic model for cost evaluation

Interleaved action-state trajectory prediction via LLM refinement

🔎 Similar Papers

No similar papers found.