World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the bottleneck of end-to-end autonomous driving—its heavy reliance on large-scale perception annotations—by proposing a perception-supervision-free driving world model framework. Methodologically, it introduces (1) an intention-aware latent-space world model that jointly encodes driving intent and scene semantics using vision foundation models; (2) self-supervised alignment between latent-state predictions and multimodal observations (images, trajectories, control signals) to enable closed-loop planning learning; and (3) a world-model selector coupled with a multimodal trajectory generation-and-evaluation mechanism. Evaluated on nuScenes and NavSim, the framework achieves an 18.1% reduction in L2 trajectory error, a 46.7% decrease in collision rate, and a 3.75× improvement in training efficiency, significantly outperforming existing unsupervised and weakly supervised approaches.

Technology Category

Application Category

📝 Abstract

End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, end-to-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.1% relative reduction in L2 error, 46.7% lower collision rate, and 3.75 faster training convergence. Codes will be accessed at https://github.com/ucaszyp/World4Drive.

Problem

Research questions and friction points this paper is trying to address.

Constructing a self-supervised driving world model

Generating multi-modal planning trajectories without perception annotations

Evaluating and selecting optimal trajectories via latent space predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision foundation models for latent world modeling

Generates multi-modal trajectories via intention-aware features

Self-supervised alignment for perception-free planning

🔎 Similar Papers

MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations