Learning Vision-Language-Action World Models for Autonomous Driving

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Current vision-language-action (VLA) models lack explicit modeling of temporal dynamics and global world consistency, limiting their foresight and safety in autonomous driving. This work proposes VLA-World, the first VLA framework integrating the generative capacity of world models with a reflective reasoning mechanism: it generates the next-frame image conditioned on actions and refines trajectory prediction through autoregressive reasoning over future frames. The method employs a three-stage training strategy—pretraining, supervised fine-tuning, and reinforcement learning—and introduces the nuScenes-GR-20K dataset. Experiments demonstrate that VLA-World outperforms existing VLA and world model approaches in both planning and future generation tasks, significantly enhancing driving performance and scene consistency.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: https://vlaworld.github.io

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

world models

autonomous driving

temporal dynamics

future reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action

World Model

Predictive Imagination