DriveVA: Video Action Models are Zero-Shot Drivers

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of autonomous driving systems in unseen scenarios and the misalignment between video prediction and trajectory planning in existing world models. To this end, the authors propose DriveVA, a novel world model that unifies future video and action sequence generation within a shared latent generative framework. By jointly decoding visual observations and driving actions, DriveVA ensures consistency between scene evolution and planning decisions. A video continuation strategy is introduced to enhance long-horizon rollout coherence. The DiT-based decoder leverages spatiotemporal priors from large-scale pretrained video generation models, enabling tight alignment between predicted dynamics and planned trajectories. Experiments demonstrate that DriveVA achieves a 90.9 PDM score on the NAVSIM challenge and significantly outperforms prior methods on nuScenes and CARLA Bench2drive, reducing L2 error by 78.9%/52.5% and collision rates by 83.3%/52.4%, showcasing exceptional zero-shot cross-domain generalization.
📝 Abstract
Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.
Problem

Research questions and friction points this paper is trying to address.

generalization
autonomous driving
world model
video-trajectory consistency
zero-shot
Innovation

Methods, ideas, or system contributions that make the work stand out.

joint video-action generation
world model
zero-shot driving
DiT-based decoder
cross-domain generalization
🔎 Similar Papers