Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latant Space

πŸ“… 2025-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing autonomous driving world models primarily model only ego-vehicle trajectories, neglecting surrounding vehicles’ motion, thereby compromising scene interaction fidelity; moreover, multi-vehicle trajectory matching and controllable generation in video remain challenging. This paper proposes EOT-WM, a unified driving world model that, for the first time, jointly models ego- and other-vehicle trajectories in the video latent space. It achieves high-fidelity, controllable video generation via BEV-to-image coordinate alignment, a spatiotemporal variational autoencoder, and a trajectory-injected diffusion Transformer. Innovatively, it maps multi-vehicle trajectories uniformly into the latent space and introduces a novel controllability metric based on similarity of control latent variables. Evaluated on nuScenes, EOT-WM reduces FID by 30% and FVD by 55% over prior methods, and enables unknown-scenario prediction conditioned on self-generated trajectories.

Technology Category

Application Category

πŸ“ Abstract
Advanced end-to-end autonomous driving systems predict other vehicles' motions and plan ego vehicle's trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the end-to-end autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In addition, it remains a challenge to match multiple trajectories with each vehicle in the video to control the video generation. To address above issues, a driving extbf{W}orld extbf{M}odel named EOT-WM is proposed in this paper, unifying extbf{E}go- extbf{O}ther vehicle extbf{T}rajectories in videos. Specifically, we first project ego and other vehicle trajectories in the BEV space into the image coordinate to match each trajectory with its corresponding vehicle in the video. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.
Problem

Research questions and friction points this paper is trying to address.

Unifies ego and other vehicle trajectories in videos.
Addresses limitations in simulating vehicle interactions.
Enhances video generation with trajectory-guided denoising.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies ego-other vehicle trajectories in video latent space
Uses Spatial-Temporal Variational Auto Encoder for encoding
Employs trajectory-injected diffusion Transformer for video generation
πŸ”Ž Similar Papers
J
Jian Zhu
Li Auto Inc.
Z
Zhengyu Jia
Li Auto Inc.
T
Tian Gao
Li Auto Inc.
J
Jiaxin Deng
Li Auto Inc.
S
Shidi Li
Li Auto Inc.
F
Fu Liu
Li Auto Inc.
X
Xianpeng Lang
Li Auto Inc.
Xiaolong Sun
Xiaolong Sun
Xi'an Jiaotong University
multimodal learning