🤖 AI Summary
VLA models for autonomous driving suffer from sparse, low-dimensional action supervision, underutilizing their high-capacity representation learning capability. To address this, we propose DriveVLA-W0—a world-model-augmented training paradigm that enhances data efficiency through dense self-supervision: an autoregressive model predicts discrete visual tokens, while a diffusion model forecasts continuous latent features, jointly enabling future image generation. A lightweight action expert module is further introduced to ensure efficient inference. Crucially, DriveVLA-W0 is the first framework to enable end-to-end joint training of vision-language-action architectures with a world model. Evaluated on NAVSIM v1/v2 and a large-scale proprietary dataset, it significantly outperforms BEV- and VLA-based baselines, demonstrating dual advantages in generalization capability and data efficiency, while advancing driving intelligence.
📝 Abstract
Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose extbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.