DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

VLA models for autonomous driving suffer from sparse, low-dimensional action supervision, underutilizing their high-capacity representation learning capability. To address this, we propose DriveVLA-W0—a world-model-augmented training paradigm that enhances data efficiency through dense self-supervision: an autoregressive model predicts discrete visual tokens, while a diffusion model forecasts continuous latent features, jointly enabling future image generation. A lightweight action expert module is further introduced to ensure efficient inference. Crucially, DriveVLA-W0 is the first framework to enable end-to-end joint training of vision-language-action architectures with a world model. Evaluated on NAVSIM v1/v2 and a large-scale proprietary dataset, it significantly outperforms BEV- and VLA-based baselines, demonstrating dual advantages in generalization capability and data efficiency, while advancing driving intelligence.

Technology Category

Application Category

📝 Abstract

Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose extbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.

Problem

Research questions and friction points this paper is trying to address.

Addresses supervision deficit in Vision-Language-Action models

Predicts future images using world modeling for dense supervision

Enhances data scaling law and real-time deployment efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

World modeling predicts future images for dense supervision

Autoregressive and diffusion models handle discrete and continuous features

Lightweight action expert reduces inference latency for deployment

🔎 Similar Papers

No similar papers found.