🤖 AI Summary
This work proposes an end-to-end world-action modeling approach for autonomous driving that leverages video generation priors. By adapting a pretrained video diffusion Transformer into an autoregressive video-action policy, the method unifies the modeling of temporal visual and action tokens. It further introduces a scene-evolution guidance mechanism and a selective key-value memory cache to enable long-horizon controllable inference and efficient scaling. This is the first study to apply video generation priors to autonomous driving action prediction, achieving state-of-the-art planning performance on the NAVSIM and PhysicalAI-AV benchmarks. Notably, the approach demonstrates strong scalability, with consistent performance gains as training data scales from 4k to 100k driving sequences.
📝 Abstract
Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.