DriveVA: Video Action Models are Zero-Shot Drivers

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limited generalization of autonomous driving systems in unseen scenarios and the misalignment between video prediction and trajectory planning in existing world models. To this end, the authors propose DriveVA, a novel world model that unifies future video and action sequence generation within a shared latent generative framework. By jointly decoding visual observations and driving actions, DriveVA ensures consistency between scene evolution and planning decisions. A video continuation strategy is introduced to enhance long-horizon rollout coherence. The DiT-based decoder leverages spatiotemporal priors from large-scale pretrained video generation models, enabling tight alignment between predicted dynamics and planned trajectories. Experiments demonstrate that DriveVA achieves a 90.9 PDM score on the NAVSIM challenge and significantly outperforms prior methods on nuScenes and CARLA Bench2drive, reducing L2 error by 78.9%/52.5% and collision rates by 83.3%/52.4%, showcasing exceptional zero-shot cross-domain generalization.

Technology Category

Application Category

📝 Abstract

Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.

Problem

Research questions and friction points this paper is trying to address.

generalization

autonomous driving

world model

video-trajectory consistency

zero-shot

Innovation

Methods, ideas, or system contributions that make the work stand out.

joint video-action generation

world model

zero-shot driving