🤖 AI Summary
This work addresses the mode collapse problem and the inherent trade-off between trajectory diversity and accuracy in diffusion-based autonomous driving motion planning. We propose TransDiffuser, an end-to-end generative model that conditions on multimodal scene inputs—namely, camera images, LiDAR point clouds, and navigation commands—and directly outputs high-quality, diverse candidate trajectories via a Transformer encoder–diffusion decoder architecture. Our key contribution is the first multimodal representation disentanglement optimization mechanism: through disentanglement-regularized training, we enforce semantic separation in the latent feature space, enabling joint improvement of diversity and accuracy without relying on anchor trajectory priors. Evaluated on the NAVSIM benchmark, TransDiffuser achieves a Planning Diversity and Matching Score (PDMS) of 94.85—significantly outperforming all state-of-the-art anchor-free methods.
📝 Abstract
In recent years, diffusion model has shown its potential across diverse domains from vision generation to language modeling. Transferring its capabilities to modern autonomous driving systems has also emerged as a promising direction.In this work, we propose TransDiffuser, an encoder-decoder based generative trajectory planning model for end-to-end autonomous driving. The encoded scene information serves as the multi-modal conditional input of the denoising decoder. To tackle the mode collapse dilemma in generating high-quality diverse trajectories, we introduce a simple yet effective multi-modal representation decorrelation optimization mechanism during the training process.TransDiffuser achieves PDMS of 94.85 on the NAVSIM benchmark, surpassing previous state-of-the-art methods without any anchor-based prior trajectories.