DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
VLA models for autonomous driving suffer from sparse, low-dimensional action supervision, underutilizing their high-capacity representation learning capability. To address this, we propose DriveVLA-W0—a world-model-augmented training paradigm that enhances data efficiency through dense self-supervision: an autoregressive model predicts discrete visual tokens, while a diffusion model forecasts continuous latent features, jointly enabling future image generation. A lightweight action expert module is further introduced to ensure efficient inference. Crucially, DriveVLA-W0 is the first framework to enable end-to-end joint training of vision-language-action architectures with a world model. Evaluated on NAVSIM v1/v2 and a large-scale proprietary dataset, it significantly outperforms BEV- and VLA-based baselines, demonstrating dual advantages in generalization capability and data efficiency, while advancing driving intelligence.

Technology Category

Application Category

📝 Abstract
Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose extbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.
Problem

Research questions and friction points this paper is trying to address.

Addresses supervision deficit in Vision-Language-Action models
Predicts future images using world modeling for dense supervision
Enhances data scaling law and real-time deployment efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

World modeling predicts future images for dense supervision
Autoregressive and diffusion models handle discrete and continuous features
Lightweight action expert reduces inference latency for deployment
🔎 Similar Papers
No similar papers found.
Yingyan Li
Yingyan Li
Institute of Automation, Chinese Academy of Sciences
computer vision
S
Shuyao Shang
NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA)
Weisong Liu
Weisong Liu
University of Massachusetts Lowell
H
Haochen Wang
NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA)
Y
Yuqi Wang
NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA)
Yuntao Chen
Yuntao Chen
Miromind
agentic aimultimodal modelcomputer vision
X
Xiaoman Wang
Yinwang Intelligent Technology Co. Ltd.
Y
Yasong An
Yinwang Intelligent Technology Co. Ltd.
C
Chufeng Tang
Yinwang Intelligent Technology Co. Ltd.
L
Lu Hou
Yinwang Intelligent Technology Co. Ltd.
L
Lue Fan
NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA)
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning