DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in existing end-to-end autonomous driving methods, which struggle to effectively couple scene dynamics prediction with action planning due to the absence of a unified latent state representation. To overcome this limitation, the authors propose DriveWorld-VLA, a novel framework that, for the first time, deeply integrates vision-language-action (VLA) modeling with world models at the representation level. By constructing a joint latent-state-centric representation, the approach enables action-conditioned, feature-level controllable imagination and supports fully end-to-end training. The method substantially reduces reliance on densely annotated data while achieving state-of-the-art performance: it attains 91.3 PDMS and 86.8 EPDMS on NAVSIMv1 and v2, respectively, and achieves a remarkably low average collision rate of 0.16 over 3 seconds on the nuScenes dataset.

Technology Category

Application Category

📝 Abstract
End-to-end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision-Language-Action (VLA) with World Models to enhance decision-making and forward-looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld-VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene-evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld-VLA incorporates the latent states of the world model as core decision-making states for the VLA planner, facilitating the planner to assess how candidate actions impact future scene evolution. By conducting world modeling entirely in the latent space, DriveWorld-VLA supports controllable, action-conditioned imagination at the feature level, avoiding expensive pixel-level rollouts. Extensive open-loop and closed-loop evaluations demonstrate the effectiveness of DriveWorld-VLA, which achieves state-of-the-art performance with 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3-second average collision rate on nuScenes. Code and models will be released in https://github.com/liulin815/DriveWorld-VLA.git.
Problem

Research questions and friction points this paper is trying to address.

autonomous driving
world modeling
Vision-Language-Action
latent space
action planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action
World Modeling
Latent Space
Autonomous Driving
Action-Conditioned Imagination
🔎 Similar Papers
No similar papers found.
F
Feiyang jia
School of Computer Science and Technology, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University
L
Lin Liu
School of Computer Science and Technology, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University
Ziying Song
Ziying Song
Beijing Jiaotong University
Object DetectionComputer VisionDeep Learning
C
Caiyan Jia
School of Computer Science and Technology, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University
H
Hangjun Ye
Xiaomi EV
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language
L
Long Chen
Xiaomi EV